Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
Acknowledge The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.
# This is Ames Housing Project. The goal is to predict residential house price in Ames, Iowa.
# We have 79 explanatory variables to use as our predictor
# Load all necessary modules
import pandas as pd # pandas module
import numpy as np # numpy module
import matplotlib.pyplot as plt # matplotlib module
import seaborn as sns # seaborn module
# load the data
ameHouse = pd.read_csv("train.csv")
ameHouse_test = pd.read_csv("test.csv")
ameHouse.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
ameHouse_test.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1461 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | ... | 120 | 0 | NaN | MnPrv | NaN | 0 | 6 | 2010 | WD | Normal |
| 1 | 1462 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | Gar2 | 12500 | 6 | 2010 | WD | Normal |
| 2 | 1463 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | MnPrv | NaN | 0 | 3 | 2010 | WD | Normal |
| 3 | 1464 | 60 | RL | 78.0 | 9978 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | NaN | 0 | 6 | 2010 | WD | Normal |
| 4 | 1465 | 120 | RL | 43.0 | 5005 | Pave | NaN | IR1 | HLS | AllPub | ... | 144 | 0 | NaN | NaN | NaN | 0 | 1 | 2010 | WD | Normal |
5 rows × 80 columns
# check for na value
null_ames = [col for col in ameHouse.columns if ameHouse[col].isnull().sum() > 0]
ameHouse[null_ames].isnull().sum().sort_values(ascending=False)/len(ameHouse)
PoolQC 0.995205 MiscFeature 0.963014 Alley 0.937671 Fence 0.807534 FireplaceQu 0.472603 LotFrontage 0.177397 GarageType 0.055479 GarageYrBlt 0.055479 GarageFinish 0.055479 GarageQual 0.055479 GarageCond 0.055479 BsmtExposure 0.026027 BsmtFinType2 0.026027 BsmtFinType1 0.025342 BsmtCond 0.025342 BsmtQual 0.025342 MasVnrArea 0.005479 MasVnrType 0.005479 Electrical 0.000685 dtype: float64
# There are 4 features with missing value ratio more than 50%.
# PoolQC, MiscFeature, Alley, and Fence
# We need to decide what need to be done upon all of these features.
# PoolQC --> Pool Quality
# MiscFeature ---> Contains feature no covered in other features
# Alley ---> Type of alley access
# Fence ---> Fence quality
# FireplaceQu ---> Fireplace Quality
These 5 high rate missing ratio features, are very rare quality to has among common residential house, so no wonder if they have high numbers of missing value. Because of this reasoning, we can just drop this features
# Drop features with missing value more than 50%
ameHouse = ameHouse.drop(['PoolQC','MiscFeature','Alley','Fence','FireplaceQu'],axis=1)
ameHouse.shape # the ncol now become 76
(1460, 76)
# Don't forget to drop the same features from test dataset
ameHouse_test = ameHouse_test.drop(['PoolQC','MiscFeature','Alley','Fence','FireplaceQu'], axis=1)
ameHouse_test.shape
(1459, 75)
Why do we don't drop 'LotFrontage'? The reason is because the missing ratio is relatively tolerable, and based on the documentation, there is chance that this feature usable for the model.
'LotFrontage' definition: Linear feet of street connected to property
# Data summary for numerical features
ameHouse.describe()
| Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | ... | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1460.000000 | 1460.000000 | 1201.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1452.000000 | 1460.000000 | ... | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 |
| mean | 730.500000 | 56.897260 | 70.049958 | 10516.828082 | 6.099315 | 5.575342 | 1971.267808 | 1984.865753 | 103.685262 | 443.639726 | ... | 94.244521 | 46.660274 | 21.954110 | 3.409589 | 15.060959 | 2.758904 | 43.489041 | 6.321918 | 2007.815753 | 180921.195890 |
| std | 421.610009 | 42.300571 | 24.284752 | 9981.264932 | 1.382997 | 1.112799 | 30.202904 | 20.645407 | 181.066207 | 456.098091 | ... | 125.338794 | 66.256028 | 61.119149 | 29.317331 | 55.757415 | 40.177307 | 496.123024 | 2.703626 | 1.328095 | 79442.502883 |
| min | 1.000000 | 20.000000 | 21.000000 | 1300.000000 | 1.000000 | 1.000000 | 1872.000000 | 1950.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 2006.000000 | 34900.000000 |
| 25% | 365.750000 | 20.000000 | 59.000000 | 7553.500000 | 5.000000 | 5.000000 | 1954.000000 | 1967.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 | 2007.000000 | 129975.000000 |
| 50% | 730.500000 | 50.000000 | 69.000000 | 9478.500000 | 6.000000 | 5.000000 | 1973.000000 | 1994.000000 | 0.000000 | 383.500000 | ... | 0.000000 | 25.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 2008.000000 | 163000.000000 |
| 75% | 1095.250000 | 70.000000 | 80.000000 | 11601.500000 | 7.000000 | 6.000000 | 2000.000000 | 2004.000000 | 166.000000 | 712.250000 | ... | 168.000000 | 68.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.000000 | 2009.000000 | 214000.000000 |
| max | 1460.000000 | 190.000000 | 313.000000 | 215245.000000 | 10.000000 | 9.000000 | 2010.000000 | 2010.000000 | 1600.000000 | 5644.000000 | ... | 857.000000 | 547.000000 | 552.000000 | 508.000000 | 480.000000 | 738.000000 | 15500.000000 | 12.000000 | 2010.000000 | 755000.000000 |
8 rows × 38 columns
# Exploring data distribuiton
ame_05_quan = ameHouse.quantile(0.05)
ame_25_quan = ameHouse.quantile(0.25)
ame_50_quan = ameHouse.quantile(0.50) # same as median
ame_75_quan = ameHouse.quantile(0.75)
ame_95_quan = ameHouse.quantile(0.95)
quantile_ame = pd.DataFrame({'ame_05':ame_05_quan,
'ame_25':ame_25_quan,
'ame_50':ame_50_quan,
'ame_75':ame_75_quan,
'ame_95':ame_95_quan})
quantile_ame
| ame_05 | ame_25 | ame_50 | ame_75 | ame_95 | |
|---|---|---|---|---|---|
| Id | 73.95 | 365.75 | 730.5 | 1095.25 | 1387.05 |
| MSSubClass | 20.00 | 20.00 | 50.0 | 70.00 | 160.00 |
| LotFrontage | 34.00 | 59.00 | 69.0 | 80.00 | 107.00 |
| LotArea | 3311.70 | 7553.50 | 9478.5 | 11601.50 | 17401.15 |
| OverallQual | 4.00 | 5.00 | 6.0 | 7.00 | 8.00 |
| OverallCond | 4.00 | 5.00 | 5.0 | 6.00 | 8.00 |
| YearBuilt | 1916.00 | 1954.00 | 1973.0 | 2000.00 | 2007.00 |
| YearRemodAdd | 1950.00 | 1967.00 | 1994.0 | 2004.00 | 2007.00 |
| MasVnrArea | 0.00 | 0.00 | 0.0 | 166.00 | 456.00 |
| BsmtFinSF1 | 0.00 | 0.00 | 383.5 | 712.25 | 1274.00 |
| BsmtFinSF2 | 0.00 | 0.00 | 0.0 | 0.00 | 396.20 |
| BsmtUnfSF | 0.00 | 223.00 | 477.5 | 808.00 | 1468.00 |
| TotalBsmtSF | 519.30 | 795.75 | 991.5 | 1298.25 | 1753.00 |
| 1stFlrSF | 672.95 | 882.00 | 1087.0 | 1391.25 | 1831.25 |
| 2ndFlrSF | 0.00 | 0.00 | 0.0 | 728.00 | 1141.05 |
| LowQualFinSF | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 |
| GrLivArea | 848.00 | 1129.50 | 1464.0 | 1776.75 | 2466.10 |
| BsmtFullBath | 0.00 | 0.00 | 0.0 | 1.00 | 1.00 |
| BsmtHalfBath | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 |
| FullBath | 1.00 | 1.00 | 2.0 | 2.00 | 2.00 |
| HalfBath | 0.00 | 0.00 | 0.0 | 1.00 | 1.00 |
| BedroomAbvGr | 2.00 | 2.00 | 3.0 | 3.00 | 4.00 |
| KitchenAbvGr | 1.00 | 1.00 | 1.0 | 1.00 | 1.00 |
| TotRmsAbvGrd | 4.00 | 5.00 | 6.0 | 7.00 | 10.00 |
| Fireplaces | 0.00 | 0.00 | 1.0 | 1.00 | 2.00 |
| GarageYrBlt | 1930.00 | 1961.00 | 1980.0 | 2002.00 | 2007.00 |
| GarageCars | 0.00 | 1.00 | 2.0 | 2.00 | 3.00 |
| GarageArea | 0.00 | 334.50 | 480.0 | 576.00 | 850.10 |
| WoodDeckSF | 0.00 | 0.00 | 0.0 | 168.00 | 335.00 |
| OpenPorchSF | 0.00 | 0.00 | 25.0 | 68.00 | 175.05 |
| EnclosedPorch | 0.00 | 0.00 | 0.0 | 0.00 | 180.15 |
| 3SsnPorch | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 |
| ScreenPorch | 0.00 | 0.00 | 0.0 | 0.00 | 160.00 |
| PoolArea | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 |
| MiscVal | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 |
| MoSold | 2.00 | 5.00 | 6.0 | 8.00 | 11.00 |
| YrSold | 2006.00 | 2007.00 | 2008.0 | 2009.00 | 2010.00 |
| SalePrice | 88000.00 | 129975.00 | 163000.0 | 214000.00 | 326100.00 |
# Let's try to visualize SalePrice distribution (Because it is our target value, there are must be something we can learn)
fig, axSales = plt.subplots(1,2,figsize=(10,4))
sns.histplot(ax=axSales[0],x='SalePrice',data=ameHouse).set_title("Histogram of SalePrice")
sns.boxplot(ax=axSales[1],y = 'SalePrice', data=ameHouse, color='red').set_title("Boxplot of SalePrice")
plt.suptitle("Distribution of SalePrice", fontweight = 'bold')
plt.subplots_adjust(wspace=0.4)
plt.show()
# The plot become clear indication of outliers existence in SalePrice
# We can make it more form by doing normality test using qq plot
import statsmodels.api as sm
sm.qqplot(ameHouse['SalePrice'],line='45')
plt.show()
'SalePrice' data cleary do not follow 45 degree line, which is a strong indication that they don't follow normal distribution.
# All numerical features
ames_numeric=ameHouse.columns[(ameHouse.dtypes=='int64') | (ameHouse.dtypes=='float64')]
Indentifies the type of dwelling involved in the sale; basically, it tells us about the residential house type The value is consist of discrete number, range from 20 up to 190. Here some definition of each number :
etc...
This feature is something that i called 'half' numerical feature, because it doesn't really represent any measurement, instead a set of quantified quality.
# Distribution of MSSubClass
fig, axMSClass = plt.subplots(1,2, figsize=(10,4))
sns.histplot(ax=axMSClass[0],x='MSSubClass',data=ameHouse, bins=16).set_title('MSSubClass in Hist')
sns.countplot(ax=axMSClass[1],x='MSSubClass', data=ameHouse).set_title('MSSubClass in Barplot')
plt.show()
The two plots above show that the MSSubClass feature is more worthy of being treated as a categorical feature
# let's drop MSSubClass from ames_numeric
#ames_numeric = ames_numeric.drop('MSSubClass')
ames_numeric
Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
dtype='object')
I've discussed this feature a bit before. Here I rewrite the definition based on the data documentation Linear feet of street connected to property. Based on gimme-shelter, Frontage is "the width of the lot, measured at front part of the lot". When it comes to real estate, bigger frontage means more land, and more land means capacity to bigger house. So,logically speaking, this feature is necessary to predict house selling price.
# For future simplicity, i create a function called numDEA() to return all statistic summary of the feature
def numDEA(col, df):
stat_summary = pd.DataFrame({
'Mean' : round(df[col].mean(),2),
'Median' : round(df[col].median(),2),
'Mode': df[col].mode()[0],
'std' : round(df[col].std(),2),
'Min': df[col].min(),
'Max': df[col].max(),
'Range': df[col].max() - df[col].min(),
'5%': df[col].quantile(0.05),
'25%': df[col].quantile(0.25),
'50%': df[col].quantile(0.50),
'75%': df[col].quantile(0.75),
'90%': df[col].quantile(0.9),
'IQR' : df[col].quantile(0.75) - df.quantile(0.25),
'Count' : df[col].count(),
'Unique': df[col].nunique(),
'Missing Value' : df[col].isnull().sum()
}, index = [col])
return stat_summary
# LotFrontage's Statistical Summary
numDEA('LotFrontage',df= ameHouse)
| Mean | Median | Mode | std | Min | Max | Range | 5% | 25% | 50% | 75% | 90% | IQR | Count | Unique | Missing Value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LotFrontage | 70.05 | 69.0 | 60.0 | 24.28 | 21.0 | 313.0 | 292.0 | 34.0 | 59.0 | 69.0 | 80.0 | 96.0 | 21.0 | 1201 | 110 | 259 |
From central tendency, we can deduce that LotFrontage feature is right-skewed data, which indicate that most of the residential house in Ames, Iowa has frontage around 69 feet (21,03 meter), more or less. There are huge difference between 90% of house frontage and maximum frontage (313 - 96 = 217), which show us the existence of small group of "elite" house with wide frontage.
# Distribution of LotFrontage
sns.histplot(x='LotFrontage',data=ameHouse).set_title('LotFrontage in Hist')
plt.show()
sns.relplot(x='LotFrontage',y='SalePrice',data=ameHouse)
plt.title("SalePrice and LotFrontage")
plt.show()
The scatterplot above gives us signs of a positive correlation between LotFrontage and SalePrice, albeit a bit of an anomaly. This anomaly give some indication of other factors influences SalePrice that i will investigate further.
From data documentation, 'LotArea' is Lot size in square feet.
# Statistical Summary of LotArea
numDEA('LotArea',df= ameHouse)
| Mean | Median | Mode | std | Min | Max | Range | 5% | 25% | 50% | 75% | 90% | IQR | Count | Unique | Missing Value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LotArea | 10516.83 | 9478.5 | 7200 | 9981.26 | 1300 | 215245 | 213945 | 3311.7 | 7553.5 | 9478.5 | 11601.5 | 14381.7 | 4048.0 | 1460 | 1073 | 0 |
# LotArea Visualization
fig_LotArea, axLotArea = plt.subplots(1,2, figsize=(15,5))
sns.histplot(ax=axLotArea[0], x= 'LotArea',data=ameHouse).set_title("LotArea Distribution")
sns.scatterplot(ax=axLotArea[1],x='LotArea',y='SalePrice',data=ameHouse).set_title("SalePrice vs LotAreab")
plt.suptitle("LotArea Distribution and Relationship", fontweight="bold")
plt.show()
Same as LotFrontage, LotArea is Right-Skewed. Majority of house in Ames, Iowa is small to medium house, while there are small fraction of house that has lotArea from 14.000 feet square (4.000 m2) up to 215.000 feet square (65.532 m2). Because of the scale, scatterplot can not depict the relationship of LotArea and SalePrice really well. To fix it, i transform these two features with natural log. After transformation, the pattern start to show.
viz_only_lotArea = pd.DataFrame({
'nl_SalePrice':np.log(ameHouse['SalePrice']),
'nl_LotArea':np.log(ameHouse['LotArea'])
})
sns.scatterplot(x='nl_LotArea',y='nl_SalePrice',data=viz_only_lotArea).set_title("Natural Log SalePrice vs Natural Log LotArea")
plt.show()
Based on data documentation, 'OverallQual' is Rates the overall material and finish of the house. The values are representation of a quality measure, such as following: 10 --- > Very Excellent 9 ---> Excellent 8 ---> Very Good 7 ---> Good 6 ---> Above Average 5 ---> Average 4 ---> Below Average 3 ---> Fair 2 ---> Poor 1 ---> Very Poor
Because it has fixed set of values, we should treat 'OverallQual' as categorical variable, specifically, in Ordinal level measurement.
# Drop OverallQual from ames_numeric
ames_numeric =ames_numeric.drop('OverallQual')
ames_numeric
Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallCond',
'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
'MoSold', 'YrSold', 'SalePrice'],
dtype='object')
# OverallQual Exploration
# Frequency
countplot_overall = pd.DataFrame({
'Count': ameHouse['OverallQual'].value_counts(),
'Percent(%)':round((ameHouse['OverallQual'].value_counts()/len(ameHouse['OverallQual']))*100,2)
})
print("The mode of this feature is : {}".format(ameHouse['OverallQual'].mode()[0]))
print(countplot_overall)
The mode of this feature is : 5
Count Percent(%)
5 397 27.19
6 374 25.62
7 319 21.85
8 168 11.51
4 116 7.95
9 43 2.95
3 20 1.37
10 18 1.23
2 3 0.21
1 2 0.14
sns.countplot(x='OverallQual', data=ameHouse).set_title('OverallQual')
plt.show()
Half of residential housing in Ames, Iowa (52,81%) has average to above average quality, with only 9,67% from total with below average to very poor quality housing. Based on this, i think it safe to deduce that Ames, Iowa is a good environment to live.
# SalePrice and OverallQual
sns.boxplot(x= 'OverallQual',y='SalePrice', data=ameHouse)
plt.title("OverallQual vs SalePrice")
plt.show()
OverallQual able to divide SalePrice distribution rather clearly. It shows how each quality has distintive range of SalePrice, and it indicate that OverallQual is a good feature to predict SalePrice.
From data documentation, OverallCond is Rates the overall condition of the house. More or less, it is similiar to OverallQual. The set of values are :
# let's drop it from ames_numeric
ames_numeric = ames_numeric.drop('OverallCond')
ames_numeric
Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'YearBuilt',
'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
'MoSold', 'YrSold', 'SalePrice'],
dtype='object')
# OverallCond Data Exploration
allCond_count = pd.DataFrame({
'Count': ameHouse['OverallCond'].value_counts(),
'Percent(%)':round((ameHouse['OverallCond'].value_counts()/len(ameHouse['OverallCond'])*100),2)
})
print("The mode of this feature is : {}".format(ameHouse['OverallCond'].mode()[0]))
print(allCond_count)
The mode of this feature is : 5 Count Percent(%) 5 821 56.23 6 252 17.26 7 205 14.04 8 72 4.93 4 57 3.90 3 25 1.71 9 22 1.51 2 5 0.34 1 1 0.07
# OverallCond Visualization
figCond, axCond = plt.subplots(1,2, figsize=(15,5))
sns.countplot(ax=axCond[0],x='OverallCond',data=ameHouse).set_title("Frequency of OverallCond")
sns.boxplot(ax=axCond[1],x='OverallCond',y='SalePrice',data=ameHouse).set_title("OverallCond vs SalePrice")
plt.show()
Different from OverallQual that able to divide SalePrice relatively well, OverallCond seems to have various relationship pattern to SalePrice. It shows by the boxplot, that house with "average" condition has wide range of SalePrice. Moreover, there are a anomaly where "poor" condition house has higher price than house with better condition. This encourage an assumption that maybe these pattern caused by weak correlation between features, or there are feature with stronger effect involved.
Based on data documentation, YearBuilt is Original construction date YearRemodAdd is Remodel date (same as construction date if no remodeling or additions) The values only specified in year, so nothing much we can do except to ensure there are not any null value, or miss entry
print('''
Missing value :
{}'''.format(ameHouse[['YearBuilt','YearRemodAdd']].isnull().sum()))
Missing value : YearBuilt 0 YearRemodAdd 0 dtype: int64
numDEA('MasVnrArea',df= ameHouse)
| Mean | Median | Mode | std | Min | Max | Range | 5% | 25% | 50% | 75% | 90% | IQR | Count | Unique | Missing Value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MasVnrArea | 103.69 | 0.0 | 0.0 | 181.07 | 0.0 | 1600.0 | 1600.0 | 0.0 | 0.0 | 0.0 | 166.0 | 335.0 | 166.0 | 1452 | 327 | 8 |
# MasVnArea Visualization
figMasVn, axMasVn = plt.subplots(1,2, figsize=(15,5))
sns.histplot(ax=axMasVn[0], x= 'MasVnrArea',data=ameHouse).set_title("MasVnrArea Distribution")
sns.scatterplot(ax=axMasVn[1],x='MasVnrArea',y='SalePrice',data=ameHouse).set_title("SalePrice vs MasVnrArea")
plt.suptitle("MasVnrArea Distribution and Relationship", fontweight="bold")
plt.show()
This data is a litle tricky. If we look upon it, almost all of house in Ames, Iowa has zero feet square of Masonry Venree. It is unclear whether they simply not use Masonry venree, or it is due to error in data entry
Based on data documentation, BsmtFinSF1 is Type 1 finished square feet. It is area of basement in certain type. BsmtFinSF2 is Type 2 finished square feet. These two features give us the same measurement, and the second only exist if a house has multiple type of basement.
print(numDEA('BsmtFinSF1',df=ameHouse))
print(numDEA('BsmtFinSF2',df=ameHouse))
Mean Median Mode std Min Max Range 5% 25% 50% \
BsmtFinSF1 443.64 383.5 0 456.1 0 5644 5644 0.0 0.0 383.5
75% 90% IQR Count Unique Missing Value
BsmtFinSF1 712.25 1065.5 712.25 1460 637 0
Mean Median Mode std Min Max Range 5% 25% 50% 75% \
BsmtFinSF2 46.55 0.0 0 161.32 0 1474 1474 0.0 0.0 0.0 0.0
90% IQR Count Unique Missing Value
BsmtFinSF2 117.2 0.0 1460 144 0
# BsmtFinSF1 Visualization
figBsmt1, axBsmt1 = plt.subplots(2,2, figsize=(15,10))
sns.histplot(ax=axBsmt1[0,0], x= 'BsmtFinSF1',data=ameHouse).set_title("BsmtFinSF1 Distribution")
sns.scatterplot(ax=axBsmt1[0,1],x='BsmtFinSF1',y='SalePrice',data=ameHouse).set_title("SalePrice vs BsmtFinSF1")
sns.histplot(ax=axBsmt1[1,0], x= 'BsmtFinSF2',data=ameHouse, color="red").set_title("BsmtFinSF2 Distribution")
sns.scatterplot(ax=axBsmt1[1,1],x='BsmtFinSF2',y='SalePrice',data=ameHouse,color="red").set_title("SalePrice vs BsmtFinSF2")
plt.suptitle("BsmtFinSF1 Distribution and Relationship", fontweight="bold")
plt.subplots_adjust(wspace=0.4)
plt.show()
If we ignore observation with zero Basement Area, there is a distinct pattern which indicate positive relationship between Basement Area type 1.But, in BsmtFinSF2, there are no clear relationship with SalePrice. BsmtFinSF2 also has smaller value range with more than 90% of its value is 0.
Unfinished square feet of basement area
numDEA('BsmtUnfSF',ameHouse)
| Mean | Median | Mode | std | Min | Max | Range | 5% | 25% | 50% | 75% | 90% | IQR | Count | Unique | Missing Value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BsmtUnfSF | 567.24 | 477.5 | 0 | 441.87 | 0 | 2336 | 2336 | 0.0 | 223.0 | 477.5 | 808.0 | 1232.0 | 585.0 | 1460 | 780 | 0 |
figBsmtUnf, axBsmtUnf = plt.subplots(1,2, figsize=(15,5))
sns.histplot(ax=axBsmtUnf[0], x= 'BsmtUnfSF',data=ameHouse).set_title("BsmtUnfSF Distribution")
sns.scatterplot(ax=axBsmtUnf[1],x='BsmtUnfSF',y='SalePrice',data=ameHouse).set_title("SalePrice vs BsmtUnfSF")
Text(0.5, 1.0, 'SalePrice vs BsmtUnfSF')
There are no clear pattern bertween BsmtUnfSF and SalePrice
# There are quite a lot of variables here. So, let's just speed things up
# Within all ames_numeric, we identify which one has smaller cardinality, and check it up to documentation, and decide whether
# they are numeric or ordinal/categorical variable
# low cardinality, abritrary, under or equal 15
ames_numeric_low_car = [col for col in ames_numeric if ameHouse[col].nunique() <= 15]
ameHouse[ames_numeric_low_car].nunique()
MSSubClass 15 BsmtFullBath 4 BsmtHalfBath 3 FullBath 4 HalfBath 3 BedroomAbvGr 8 KitchenAbvGr 4 TotRmsAbvGrd 12 Fireplaces 4 GarageCars 5 PoolArea 8 MoSold 12 YrSold 5 dtype: int64
figMisc, axMisc = plt.subplots(2,3, figsize=(15,10))
sns.boxplot(ax=axMisc[0,0],x='MSSubClass',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axMisc[0,1],x='BsmtFullBath',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axMisc[0,2],x='BsmtHalfBath',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axMisc[1,0],x='FullBath',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axMisc[1,1],x='HalfBath',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axMisc[1,2],x='Fireplaces',y="SalePrice",data=ameHouse)
plt.subplots_adjust(wspace=0.4)
plt.show()
Based on the feature definition, all of this low cardinality feature (except for MSSubClass) are numerical discrete value. They represents a number of facility exists in a house. But, because of their value nature, the proper treatment is to group them together with categorical features.
# Separate high cardinality numeric feature from low cardinality numeric feature
ames_numeric_high_car = [col for col in ames_numeric if col not in ames_numeric_low_car]
ames_numeric_high_car
['Id', 'LotFrontage', 'LotArea', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageYrBlt', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'MiscVal', 'SalePrice']
# call all statistic summary for ames numeric feature with high cardinality
for i in ames_numeric_high_car:
print(numDEA(i,df=ameHouse),end="\n\n")
Mean Median Mode std Min Max Range 5% 25% 50% \
Id 730.5 730.5 1 421.61 1 1460 1459 73.95 365.75 730.5
75% 90% IQR Count Unique Missing Value
Id 1095.25 1314.1 729.5 1460 1460 0
Mean Median Mode std Min Max Range 5% 25% 50% \
LotFrontage 70.05 69.0 60.0 24.28 21.0 313.0 292.0 34.0 59.0 69.0
75% 90% IQR Count Unique Missing Value
LotFrontage 80.0 96.0 21.0 1201 110 259
Mean Median Mode std Min Max Range 5% \
LotArea 10516.83 9478.5 7200 9981.26 1300 215245 213945 3311.7
25% 50% 75% 90% IQR Count Unique \
LotArea 7553.5 9478.5 11601.5 14381.7 4048.0 1460 1073
Missing Value
LotArea 0
Mean Median Mode std Min Max Range 5% 25% \
YearBuilt 1971.27 1973.0 2006 30.2 1872 2010 138 1916.0 1954.0
50% 75% 90% IQR Count Unique Missing Value
YearBuilt 1973.0 2000.0 2006.0 46.0 1460 112 0
Mean Median Mode std Min Max Range 5% 25% \
YearRemodAdd 1984.87 1994.0 1950 20.65 1950 2010 60 1950.0 1967.0
50% 75% 90% IQR Count Unique Missing Value
YearRemodAdd 1994.0 2004.0 2006.0 37.0 1460 61 0
Mean Median Mode std Min Max Range 5% 25% 50% \
MasVnrArea 103.69 0.0 0.0 181.07 0.0 1600.0 1600.0 0.0 0.0 0.0
75% 90% IQR Count Unique Missing Value
MasVnrArea 166.0 335.0 166.0 1452 327 8
Mean Median Mode std Min Max Range 5% 25% 50% \
BsmtFinSF1 443.64 383.5 0 456.1 0 5644 5644 0.0 0.0 383.5
75% 90% IQR Count Unique Missing Value
BsmtFinSF1 712.25 1065.5 712.25 1460 637 0
Mean Median Mode std Min Max Range 5% 25% 50% 75% \
BsmtFinSF2 46.55 0.0 0 161.32 0 1474 1474 0.0 0.0 0.0 0.0
90% IQR Count Unique Missing Value
BsmtFinSF2 117.2 0.0 1460 144 0
Mean Median Mode std Min Max Range 5% 25% 50% \
BsmtUnfSF 567.24 477.5 0 441.87 0 2336 2336 0.0 223.0 477.5
75% 90% IQR Count Unique Missing Value
BsmtUnfSF 808.0 1232.0 585.0 1460 780 0
Mean Median Mode std Min Max Range 5% 25% \
TotalBsmtSF 1057.43 991.5 0 438.71 0 6110 6110 519.3 795.75
50% 75% 90% IQR Count Unique Missing Value
TotalBsmtSF 991.5 1298.25 1602.2 502.5 1460 721 0
Mean Median Mode std Min Max Range 5% 25% \
1stFlrSF 1162.63 1087.0 864 386.59 334 4692 4358 672.95 882.0
50% 75% 90% IQR Count Unique Missing Value
1stFlrSF 1087.0 1391.25 1680.0 509.25 1460 753 0
Mean Median Mode std Min Max Range 5% 25% 50% \
2ndFlrSF 346.99 0.0 0 436.53 0 2065 2065 0.0 0.0 0.0
75% 90% IQR Count Unique Missing Value
2ndFlrSF 728.0 954.2 728.0 1460 417 0
Mean Median Mode std Min Max Range 5% 25% 50% 75% \
LowQualFinSF 5.84 0.0 0 48.62 0 572 572 0.0 0.0 0.0 0.0
90% IQR Count Unique Missing Value
LowQualFinSF 0.0 0.0 1460 24 0
Mean Median Mode std Min Max Range 5% 25% \
GrLivArea 1515.46 1464.0 864 525.48 334 5642 5308 848.0 1129.5
50% 75% 90% IQR Count Unique Missing Value
GrLivArea 1464.0 1776.75 2158.3 647.25 1460 861 0
Mean Median Mode std Min Max Range 5% \
GarageYrBlt 1978.51 1980.0 2005.0 24.69 1900.0 2010.0 110.0 1930.0
25% 50% 75% 90% IQR Count Unique \
GarageYrBlt 1961.0 1980.0 2002.0 2006.0 41.0 1379 97
Missing Value
GarageYrBlt 81
Mean Median Mode std Min Max Range 5% 25% 50% \
GarageArea 472.98 480.0 0 213.8 0 1418 1418 0.0 334.5 480.0
75% 90% IQR Count Unique Missing Value
GarageArea 576.0 757.1 241.5 1460 441 0
Mean Median Mode std Min Max Range 5% 25% 50% \
WoodDeckSF 94.24 0.0 0 125.34 0 857 857 0.0 0.0 0.0
75% 90% IQR Count Unique Missing Value
WoodDeckSF 168.0 262.0 168.0 1460 274 0
Mean Median Mode std Min Max Range 5% 25% 50% \
OpenPorchSF 46.66 25.0 0 66.26 0 547 547 0.0 0.0 25.0
75% 90% IQR Count Unique Missing Value
OpenPorchSF 68.0 130.0 68.0 1460 202 0
Mean Median Mode std Min Max Range 5% 25% 50% \
EnclosedPorch 21.95 0.0 0 61.12 0 552 552 0.0 0.0 0.0
75% 90% IQR Count Unique Missing Value
EnclosedPorch 0.0 112.0 0.0 1460 120 0
Mean Median Mode std Min Max Range 5% 25% 50% 75% \
3SsnPorch 3.41 0.0 0 29.32 0 508 508 0.0 0.0 0.0 0.0
90% IQR Count Unique Missing Value
3SsnPorch 0.0 0.0 1460 20 0
Mean Median Mode std Min Max Range 5% 25% 50% 75% \
ScreenPorch 15.06 0.0 0 55.76 0 480 480 0.0 0.0 0.0 0.0
90% IQR Count Unique Missing Value
ScreenPorch 0.0 0.0 1460 76 0
Mean Median Mode std Min Max Range 5% 25% 50% 75% \
MiscVal 43.49 0.0 0 496.12 0 15500 15500 0.0 0.0 0.0 0.0
90% IQR Count Unique Missing Value
MiscVal 0.0 0.0 1460 21 0
Mean Median Mode std Min Max Range \
SalePrice 180921.2 163000.0 140000 79442.5 34900 755000 720100
5% 25% 50% 75% 90% IQR Count \
SalePrice 88000.0 129975.0 163000.0 214000.0 278000.0 84025.0 1460
Unique Missing Value
SalePrice 663 0
From this summary, i find that, even in this high cardinality features, there are still few with high zero value percentage, like 'MasVnrArea', 'BsmtFinSF2','2ndFlrSF','WoodDeckSF',and 'ScreenPorch'.
# Let's check if if these features still worth to keep
figMisc2, axMisc2 = plt.subplots(2,3, figsize=(15,10))
sns.regplot(ax=axMisc2[0,0],x='MasVnrArea',y="SalePrice",data=ameHouse)
sns.regplot(ax=axMisc2[0,1],x='BsmtFinSF2',y="SalePrice",data=ameHouse)
sns.regplot(ax=axMisc2[0,2],x='2ndFlrSF',y="SalePrice",data=ameHouse)
sns.regplot(ax=axMisc2[1,0],x='WoodDeckSF',y="SalePrice",data=ameHouse)
sns.regplot(ax=axMisc2[1,1],x='ScreenPorch',y="SalePrice",data=ameHouse)
plt.subplots_adjust(wspace=0.4)
plt.show()
Aside of 'MasVnrArea', all of these features doesn't seem worth to keep around. The rarity itself is a problem. Actually, looking back, all features with high zero percentage value is caused by rarity factor; it came back to the fact we found early about the existence of some elite house. It makes sense that not all house has pool, or wood deck,or 3 Season Porch, right?
# Keep all feature as numeric feature, except for MSSubClass
#ames_numeric = ames_numeric.drop('MSSubClass')
len(ames_numeric)
36
# Analysis on Categorical Feature
ames_cat = [col for col in ameHouse.columns if ameHouse[col].dtypes=='object']
print(len(ames_cat))
ameHouse[ames_cat].nunique()
38
MSZoning 5 Street 2 LotShape 4 LandContour 4 Utilities 2 LotConfig 5 LandSlope 3 Neighborhood 25 Condition1 9 Condition2 8 BldgType 5 HouseStyle 8 RoofStyle 6 RoofMatl 8 Exterior1st 15 Exterior2nd 16 MasVnrType 4 ExterQual 4 ExterCond 5 Foundation 6 BsmtQual 4 BsmtCond 4 BsmtExposure 4 BsmtFinType1 6 BsmtFinType2 6 Heating 6 HeatingQC 5 CentralAir 2 Electrical 5 KitchenQual 4 Functional 7 GarageType 6 GarageFinish 3 GarageQual 5 GarageCond 5 PavedDrive 3 SaleType 9 SaleCondition 6 dtype: int64
# Check missing value
ameHouse[ames_cat].isnull().sum()
MSZoning 0 Street 0 LotShape 0 LandContour 0 Utilities 0 LotConfig 0 LandSlope 0 Neighborhood 0 Condition1 0 Condition2 0 BldgType 0 HouseStyle 0 RoofStyle 0 RoofMatl 0 Exterior1st 0 Exterior2nd 0 MasVnrType 8 ExterQual 0 ExterCond 0 Foundation 0 BsmtQual 37 BsmtCond 37 BsmtExposure 38 BsmtFinType1 37 BsmtFinType2 38 Heating 0 HeatingQC 0 CentralAir 0 Electrical 1 KitchenQual 0 Functional 0 GarageType 81 GarageFinish 81 GarageQual 81 GarageCond 81 PavedDrive 0 SaleType 0 SaleCondition 0 dtype: int64
Here i have 38 features treated as categorical. In one glance, all missing value is caused by natural condition (no garage, no basement, etc). I build some visualization to make this rather presentable
figCat, axCat = plt.subplots(3,3, figsize=(15,15))
sns.countplot(ax=axCat[0,0],x='MSZoning',data=ameHouse)
sns.countplot(ax=axCat[0,1],x='Street',data=ameHouse)
sns.countplot(ax=axCat[0,2],x='LotShape',data=ameHouse)
sns.countplot(ax=axCat[1,0],x='LandContour',data=ameHouse)
sns.countplot(ax=axCat[1,1],x='Utilities',data=ameHouse)
sns.countplot(ax=axCat[1,2],x='LotConfig',data=ameHouse)
sns.countplot(ax=axCat[2,0],x='LandSlope',data=ameHouse)
sns.countplot(ax=axCat[2,1],x='Neighborhood',data=ameHouse)
sns.countplot(ax=axCat[2,2],x='Condition1',data=ameHouse)
plt.subplots_adjust(wspace=0.4)
plt.show()
figCat, axCat = plt.subplots(3,3, figsize=(15,15))
sns.boxplot(ax=axCat[0,0],x='MSZoning',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[0,1],x='Street',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[0,2],x='LotShape',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[1,0],x='LandContour',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[1,1],x='Utilities',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[1,2],x='LotConfig',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[2,0],x='LandSlope',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[2,1],x='Neighborhood',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[2,2],x='Condition1',y="SalePrice",data=ameHouse)
plt.subplots_adjust(wspace=0.4)
plt.show()
Judging from the distribution and count value per category, almost all of nine categoricals we visualized has unbalanced proportion, dominated by certain group.
### Preprocessing Data
# As we start to pre processing data, it's time to split the training dataset
# Import train_test_split module
from sklearn.model_selection import train_test_split
# separate target variable from the rest
# in this project, the target is SalePrice
X = ameHouse.copy(deep=True)
y = X.pop("SalePrice")
X.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | LotShape | LandContour | Utilities | LotConfig | ... | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | Reg | Lvl | AllPub | Inside | ... | 61 | 0 | 0 | 0 | 0 | 0 | 2 | 2008 | WD | Normal |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | Reg | Lvl | AllPub | FR2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 2007 | WD | Normal |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | IR1 | Lvl | AllPub | Inside | ... | 42 | 0 | 0 | 0 | 0 | 0 | 9 | 2008 | WD | Normal |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | IR1 | Lvl | AllPub | Corner | ... | 35 | 272 | 0 | 0 | 0 | 0 | 2 | 2006 | WD | Abnorml |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | IR1 | Lvl | AllPub | FR2 | ... | 84 | 0 | 0 | 0 | 0 | 0 | 12 | 2008 | WD | Normal |
5 rows × 75 columns
y.head()
0 208500 1 181500 2 223500 3 140000 4 250000 Name: SalePrice, dtype: int64
train_X, val_X, train_y, val_y = train_test_split(X,y,train_size=0.8,test_size=0.2, random_state=0)
print(train_X.shape)
print(val_X.shape)
print(train_y.shape)
print(val_y.shape)
(1168, 75) (292, 75) (1168,) (292,)
As we know from previous DEA result, there are some features contain missing value. Some of them has been dropped because of their high percentage of missing value. What left is to deal with the remaining problem.
It's easy to get rid of entire rows of data with missing values, but I don't want to risk losing too much important information.
To do that, i will try to do some reasoning on each feature with missing values, to find out the cause of the missing values. If the cause is natural, then I will remove the row or replace it with a zero value. On the other hand, if the cause of the missing value is an error, then I will do imputation.
# Recall all the missing
miss_col = [col for col in train_X.columns if train_X[col].isnull().sum()>0]
print(len(miss_col))
train_X[miss_col].isnull().sum()
14
LotFrontage 212 MasVnrType 6 MasVnrArea 6 BsmtQual 28 BsmtCond 28 BsmtExposure 28 BsmtFinType1 28 BsmtFinType2 29 Electrical 1 GarageType 58 GarageYrBlt 58 GarageFinish 58 GarageQual 58 GarageCond 58 dtype: int64
# There are 14 features contains missing value.
# Let's examine each of it
# LotFrontage, as stated before, is the linear feet of street connected to the property
# First, let's think it. Is it possible for a house, a residential house, to not has any street or alley in its front?
# While it still possible in a rural area, it seems not be the case in this dataset.
# To Prove it,there are few features we can use.
# I extract all index with null value of LotFrontage
miss_lotFrontage = train_X[train_X['LotFrontage'].isnull()].index.tolist()
# Street == Type of road access to property
# This feature consist of two categories, ie 'Gravel' and 'Paved'
sns.countplot(x=train_X.loc[miss_lotFrontage,'Street'])
plt.title("Street Access of house with null value of LotFrontage", fontweight = "bold")
plt.show()
print("Missing value :{}".format(train_X['Street'].isnull().sum()))
# no missing value, means all houses in our train dataset have access to the road`
# Most of house with null LotFrontage has Pave road connected to it, so it is make no sense.
Missing value :0
# Now i know that missing value in LotFrontage is caused by error factor. The next step is to do some imputation to fill these
# empty value
# For this case, i want to use median value of LotFrontage based on its MSZoning and MSSubClass. This abritrary decision.
median_lotFrontage_by_ZoningClass = train_X.groupby(['MSZoning','MSSubClass'])['LotFrontage'].median()
train_X.groupby(['MSZoning','MSSubClass'])['LotFrontage'].median()
# Start imputation
MSZoning MSSubClass
C (all) 20 58.0
30 90.0
50 66.0
70 50.0
190 60.0
FV 20 73.5
60 75.0
120 50.0
160 30.0
RH 20 60.0
30 70.0
45 60.0
50 55.0
70 54.5
90 82.0
120 34.0
190 60.0
RL 20 75.0
30 60.0
40 60.0
45 57.0
50 60.0
60 79.0
70 66.0
75 60.0
80 80.0
85 73.0
90 71.0
120 44.0
160 36.0
190 71.5
RM 20 70.0
30 56.0
40 40.0
45 53.0
50 52.0
60 60.0
70 60.0
75 70.0
90 68.0
120 40.5
160 21.0
180 21.0
190 60.0
Name: LotFrontage, dtype: float64
# Imputation
# Python may give warning like this:
# SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
# Disable this by write this code instead
# pd.options.mode.chained_assignment = None # default='warn'
#.loc[row_index,col_indexer] = value instead
pd.options.mode.chained_assignment = None
miss_lotFrontage = train_X[train_X['LotFrontage'].isnull()].index.tolist()
for j in miss_lotFrontage:
for i in range(len(median_lotFrontage_by_ZoningClass)):
if(train_X.loc[j,'MSZoning'] == median_lotFrontage_by_ZoningClass.index[i][0] and train_X.loc[j,'MSSubClass']==median_lotFrontage_by_ZoningClass.index[i][1]):
train_X.loc[j,'LotFrontage'] = median_lotFrontage_by_ZoningClass[i]
# Imputation Success
train_X.loc[miss_lotFrontage,'LotFrontage']
817 75.0
1164 80.0
186 80.0
1441 40.5
465 40.5
...
845 73.0
537 75.0
1033 75.0
1383 60.0
559 44.0
Name: LotFrontage, Length: 212, dtype: float64
train_X['LotFrontage'].isnull().sum() # no more missing value
0
# Don't forget to do the same thing to validation and test set
print(val_X['LotFrontage'].isnull().sum()) # 47 missing values in validation dataset
print(ameHouse_test['LotFrontage'].isnull().sum()) # 227 missing values in test dataset
47 227
Valid_median_lotFrontage_by_ZoningClass = val_X.groupby(['MSZoning','MSSubClass'])['LotFrontage'].median()
Test_median_lotFrontage_by_ZoningClass = ameHouse_test.groupby(['MSZoning','MSSubClass'])['LotFrontage'].median()
Valid_median_lotFrontage_by_ZoningClass
# There is a null value of median in validation dataset. I replace it with the median from the same grouping in test dataset
Valid_median_lotFrontage_by_ZoningClass[5] = 60
Valid_median_lotFrontage_by_ZoningClass
MSZoning MSSubClass
C (all) 50 60.0
FV 20 78.0
60 75.0
120 41.0
160 34.0
RH 190 60.0
RL 20 71.0
30 47.5
50 79.0
60 77.5
70 60.0
80 72.5
85 68.0
90 64.5
120 47.0
160 24.0
190 60.0
RM 20 50.0
30 60.0
40 50.0
45 50.0
50 55.5
70 60.0
75 76.0
90 78.0
120 48.0
160 21.0
180 21.0
190 65.0
Name: LotFrontage, dtype: float64
# Apparently, test data set has different columns set with missing value compared to valid and training data set.
# I put it aside for now, and will deal with it in separate section.
Test_median_lotFrontage_by_ZoningClass
# ameHouse_test.isnull().sum() #uncomment this line to show which columns has missing value in test dataset
MSZoning MSSubClass
C (all) 20 65.0
30 66.0
50 60.0
70 60.0
190 62.5
FV 20 75.0
60 80.0
120 36.0
160 24.0
RH 20 80.0
30 70.0
50 70.0
90 60.0
120 26.0
190 59.0
RL 20 75.0
30 60.0
40 58.5
45 74.5
50 60.0
60 75.0
70 61.5
75 72.0
80 74.0
85 72.0
90 70.0
120 45.0
150 NaN
160 24.0
190 70.0
RM 20 63.5
30 56.0
45 50.0
50 56.0
60 58.5
70 60.0
75 65.0
80 76.0
85 86.0
90 68.0
120 42.5
160 21.0
180 28.0
190 60.0
Name: LotFrontage, dtype: float64
# get all row of valid dataset with missing lotFrontage value
Valid_miss_lotFrontage = val_X[val_X['LotFrontage'].isnull()].index.tolist()
# Imputed the value with median
for j in Valid_miss_lotFrontage:
for i in range(len(Valid_median_lotFrontage_by_ZoningClass)):
if(val_X.loc[j,'MSZoning'] == Valid_median_lotFrontage_by_ZoningClass.index[i][0] and val_X.loc[j,'MSSubClass']==Valid_median_lotFrontage_by_ZoningClass.index[i][1]):
val_X.loc[j,'LotFrontage'] = Valid_median_lotFrontage_by_ZoningClass[i]
# Check the imputed value
val_X['LotFrontage'].isnull().sum() # No More Missing Value
0
# MasVnrType and MasVnrArea Imputation
# The reason i group these two together is because they are represents the same thing, that is Masonry Venree.
# Based on data documentation, a house without Masonry Venree valued as None.
# Logically, there will be no Masonry area if no Masonry Venree is build
# and that's why MasVnrType and MasVnrArea has the same number of missing value
# To impute this features, i will replace None value with string "None" for MasVnrType and 0 value for MasVnrArea
miss_MasVnr = train_X[train_X['MasVnrType'].isnull()].index.tolist()
# Imputing MasVnrType
train_X.loc[miss_MasVnr, 'MasVnrType'] = "None"
# Imputing MasVnrArea
train_X.loc[miss_MasVnr, 'MasVnrArea'] = 0.0
# check
print(train_X['MasVnrType'].isnull().sum()) # no missing value
print(train_X['MasVnrArea'].isnull().sum()) # no missing value
0 0
# Don't forget to do the same thing to validation and test dataset
# Check the missing value
print("Missing value of MasVnrType in val_X: {}".format(val_X['MasVnrType'].isnull().sum()))
print("Missing value of MasVnrArea in val_X: {}".format(val_X['MasVnrArea'].isnull().sum()))
print("Missing value of MasVnrType in ameHouse_test: {}".format(ameHouse_test['MasVnrType'].isnull().sum()))
print("Missing value of MasVnrArea in ameHouse_test: {}".format(ameHouse_test['MasVnrArea'].isnull().sum()))
Missing value of MasVnrType in val_X: 2 Missing value of MasVnrArea in val_X: 2 Missing value of MasVnrType in ameHouse_test: 16 Missing value of MasVnrArea in ameHouse_test: 15
# get the index of missing value in validation dataset
Val_miss_MasVnr = val_X[val_X['MasVnrType'].isnull()].index.tolist()
# Start imputing
val_X.loc[Val_miss_MasVnr,'MasVnrType'] = "None"
val_X.loc[Val_miss_MasVnr,'MasVnrArea'] = 0.0
# Check missing
print("Missing value of MasVnrType in val_X: {}".format(val_X['MasVnrType'].isnull().sum()))
print("Missing value of MasVnrArea in val_X: {}".format(val_X['MasVnrArea'].isnull().sum()))
Missing value of MasVnrType in val_X: 0 Missing value of MasVnrArea in val_X: 0
# Imputing test dataset MassVnr
Test_miss_MasVnr = ameHouse_test[ameHouse_test['MasVnrType'].isnull()].index.tolist()
ameHouse_test.loc[Test_miss_MasVnr, 'MasVnrType'] = "None"
ameHouse_test.loc[Test_miss_MasVnr, 'MasVnrArea'] = 0.0
print("Missing value of MasVnrType in ameHouse_test: {}".format(ameHouse_test['MasVnrType'].isnull().sum()))
print("Missing value of MasVnrArea in ameHouse_test: {}".format(ameHouse_test['MasVnrArea'].isnull().sum()))
Missing value of MasVnrType in ameHouse_test: 0 Missing value of MasVnrArea in ameHouse_test: 0
# Imputing BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2
# Same as MasVnr... , These 4 features represents basement. House without basement will valued as NA
# Because all of them are categorical features, i will replace it with string 'NoBasement'
Train_miss_Bsmt=train_X[train_X['BsmtQual'].isnull()][['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2']].index.tolist()
train_X.loc[Train_miss_Bsmt,['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']]='NoBasement'
print("Missing value of BsmtQual:{}".format(train_X['BsmtQual'].isnull().sum()))
print("Missing value of BsmtCond:{}".format(train_X['BsmtCond'].isnull().sum()))
print("Missing value of BsmtExposure:{}".format(train_X['BsmtExposure'].isnull().sum()))
print("Missing value of BsmtFinType1:{}".format(train_X['BsmtFinType1'].isnull().sum()))
print("Missing value of BsmtFinType2:{}".format(train_X['BsmtFinType2'].isnull().sum()))
Missing value of BsmtQual:0 Missing value of BsmtCond:0 Missing value of BsmtExposure:0 Missing value of BsmtFinType1:0 Missing value of BsmtFinType2:1
train_X[train_X['BsmtFinType2'].isnull()][['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','BsmtFinSF2']]
| BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinType2 | BsmtFinSF2 | |
|---|---|---|---|---|---|---|
| 332 | Gd | TA | No | GLQ | NaN | 479 |
# i use this proportion of finished basement type 2 for imputing the only remaining missing value of basement type 2
train_X.loc[train_X['BsmtFinType2']!="Unf",'BsmtFinType2'].value_counts()
LwQ 42 Rec 39 BLQ 30 NoBasement 28 ALQ 14 GLQ 11 Name: BsmtFinType2, dtype: int64
train_X.loc[332, 'BsmtFinType2'] = "LwQ"
print("Missing value of BsmtFinType2:{}".format(train_X['BsmtFinType2'].isnull().sum()))
Missing value of BsmtFinType2:0
print(val_X[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']].isnull().sum(),end="\n\n")
print(ameHouse_test[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']].isnull().sum())
BsmtQual 9 BsmtCond 9 BsmtExposure 10 BsmtFinType1 9 BsmtFinType2 9 dtype: int64 BsmtQual 44 BsmtCond 45 BsmtExposure 44 BsmtFinType1 42 BsmtFinType2 42 dtype: int64
val_miss_bsmt = val_X[val_X['BsmtExposure'].isnull()].index.tolist()
val_X.loc[val_miss_bsmt,['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']]
| BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinType2 | |
|---|---|---|---|---|---|
| 1049 | NaN | NaN | NaN | NaN | NaN |
| 1000 | NaN | NaN | NaN | NaN | NaN |
| 362 | NaN | NaN | NaN | NaN | NaN |
| 182 | NaN | NaN | NaN | NaN | NaN |
| 897 | NaN | NaN | NaN | NaN | NaN |
| 1090 | NaN | NaN | NaN | NaN | NaN |
| 342 | NaN | NaN | NaN | NaN | NaN |
| 39 | NaN | NaN | NaN | NaN | NaN |
| 948 | Gd | TA | NaN | Unf | Unf |
| 520 | NaN | NaN | NaN | NaN | NaN |
sns.countplot(x='BsmtQual',hue='BsmtExposure',data=val_X)
plt.show()
# Based on the chart above, it seems that basement with Good Quality(90-99 inch) has higher propotion for basement with
# Good Exposure. It may come from the fact that higher basement able to get optimum sunlight and air circulation exposure.
# I use this as my imputation basis for row number 984
val_X.loc[948,'BsmtExposure'] = 'Gd'
# Let's update the missing value index
val_miss_bsmt = val_X[val_X['BsmtExposure'].isnull()].index.tolist()
val_miss_bsmt
# Imputed the remaining rows
val_X.loc[val_miss_bsmt,['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']]='NoBasement'
# Check the value
val_X.loc[val_miss_bsmt,['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']]
val_X[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']].isnull().sum() # Clear
BsmtQual 0 BsmtCond 0 BsmtExposure 0 BsmtFinType1 0 BsmtFinType2 0 dtype: int64
# Basement in Test data set
Test_miss_bsmt = ameHouse_test.loc[ameHouse_test['BsmtFinType1'].isnull()].index.tolist()
ameHouse_test.loc[Test_miss_bsmt, ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']]
| BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinType2 | |
|---|---|---|---|---|---|
| 125 | NaN | NaN | NaN | NaN | NaN |
| 133 | NaN | NaN | NaN | NaN | NaN |
| 269 | NaN | NaN | NaN | NaN | NaN |
| 318 | NaN | NaN | NaN | NaN | NaN |
| 354 | NaN | NaN | NaN | NaN | NaN |
| 387 | NaN | NaN | NaN | NaN | NaN |
| 388 | NaN | NaN | NaN | NaN | NaN |
| 396 | NaN | NaN | NaN | NaN | NaN |
| 397 | NaN | NaN | NaN | NaN | NaN |
| 398 | NaN | NaN | NaN | NaN | NaN |
| 400 | NaN | NaN | NaN | NaN | NaN |
| 455 | NaN | NaN | NaN | NaN | NaN |
| 590 | NaN | NaN | NaN | NaN | NaN |
| 606 | NaN | NaN | NaN | NaN | NaN |
| 608 | NaN | NaN | NaN | NaN | NaN |
| 660 | NaN | NaN | NaN | NaN | NaN |
| 662 | NaN | NaN | NaN | NaN | NaN |
| 728 | NaN | NaN | NaN | NaN | NaN |
| 729 | NaN | NaN | NaN | NaN | NaN |
| 730 | NaN | NaN | NaN | NaN | NaN |
| 733 | NaN | NaN | NaN | NaN | NaN |
| 756 | NaN | NaN | NaN | NaN | NaN |
| 764 | NaN | NaN | NaN | NaN | NaN |
| 927 | NaN | NaN | NaN | NaN | NaN |
| 975 | NaN | NaN | NaN | NaN | NaN |
| 992 | NaN | NaN | NaN | NaN | NaN |
| 993 | NaN | NaN | NaN | NaN | NaN |
| 1030 | NaN | NaN | NaN | NaN | NaN |
| 1038 | NaN | NaN | NaN | NaN | NaN |
| 1087 | NaN | NaN | NaN | NaN | NaN |
| 1092 | NaN | NaN | NaN | NaN | NaN |
| 1104 | NaN | NaN | NaN | NaN | NaN |
| 1118 | NaN | NaN | NaN | NaN | NaN |
| 1139 | NaN | NaN | NaN | NaN | NaN |
| 1242 | NaN | NaN | NaN | NaN | NaN |
| 1303 | NaN | NaN | NaN | NaN | NaN |
| 1306 | NaN | NaN | NaN | NaN | NaN |
| 1343 | NaN | NaN | NaN | NaN | NaN |
| 1344 | NaN | NaN | NaN | NaN | NaN |
| 1364 | NaN | NaN | NaN | NaN | NaN |
| 1431 | NaN | NaN | NaN | NaN | NaN |
| 1444 | NaN | NaN | NaN | NaN | NaN |
# Impute this rows with the same method as Train and Valid dataset
ameHouse_test.loc[Test_miss_bsmt,['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']] = 'NoBasement'
ameHouse_test[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']].isnull().sum()
BsmtQual 2 BsmtCond 3 BsmtExposure 2 BsmtFinType1 0 BsmtFinType2 0 dtype: int64
Test2_miss_bsmt = ameHouse_test[ameHouse_test['BsmtExposure'].isnull()].index.tolist()
Test_inquiry_bsmt = ameHouse_test.loc[(ameHouse_test['BsmtQual']=='Gd') & (ameHouse_test['BsmtCond']=='TA') & (ameHouse_test['BsmtFinType1']=='Unf')
& (ameHouse_test['BsmtFinType2']=='Unf')].index.tolist()
ameHouse_test.loc[Test_inquiry_bsmt,'BsmtExposure'].value_counts()
# We mining all rows from test dataset that has the same condition
# Start from this, i will impute the missing BsmtExposure with 'No'
No 119 Av 32 Mn 12 Gd 5 Name: BsmtExposure, dtype: int64
ameHouse_test.loc[Test2_miss_bsmt,'BsmtExposure'] = 'No'
ameHouse_test[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']].isnull().sum()
BsmtQual 2 BsmtCond 3 BsmtExposure 0 BsmtFinType1 0 BsmtFinType2 0 dtype: int64
# imputing the BsmtQual in test dataset
Test3_miss_bsmt = ameHouse_test.loc[ameHouse_test['BsmtQual'].isnull()].index.tolist()
ameHouse_test.loc[Test3_miss_bsmt,['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']]
| BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinType2 | |
|---|---|---|---|---|---|
| 757 | NaN | Fa | No | Unf | Unf |
| 758 | NaN | TA | No | Unf | Unf |
Test4_miss_bsmt = ameHouse_test.loc[(ameHouse_test['BsmtCond']=='Fa') | (ameHouse_test['BsmtCond']=='TA') & (ameHouse_test['BsmtExposure']=="No") &
(ameHouse_test['BsmtFinType1']=="Unf") & (ameHouse_test['BsmtFinType2']=='Unf')].index.tolist()
ameHouse_test.loc[Test4_miss_bsmt,'BsmtQual'].value_counts()
TA 193 Gd 128 Fa 35 Ex 9 Name: BsmtQual, dtype: int64
# Impute BsmtQual in test dataset with 'TA'
ameHouse_test.loc[Test4_miss_bsmt,'BsmtQual'] = 'TA'
ameHouse_test[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']].isnull().sum()
BsmtQual 0 BsmtCond 3 BsmtExposure 0 BsmtFinType1 0 BsmtFinType2 0 dtype: int64
# Imputing BsmtCond in Test dataset
Test5_miss_bsmt = ameHouse_test.loc[ameHouse_test['BsmtCond'].isnull()].index.tolist()
ameHouse_test.loc[Test5_miss_bsmt,['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']]
| BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinType2 | |
|---|---|---|---|---|---|
| 580 | Gd | NaN | Mn | GLQ | Rec |
| 725 | TA | NaN | No | BLQ | Unf |
| 1064 | TA | NaN | Av | ALQ | Unf |
# I use median for this SimpleImputer strategy that is most_frequent
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# define the imputer
simpleImp= SimpleImputer(missing_values = np.nan, strategy='most_frequent')
# Define the ColumnTransformer
column_trans = ColumnTransformer([('imputed_BsmtCond',simpleImp,[30])])
imputed_ameHouse_test = column_trans.fit_transform(ameHouse_test)
#new data
imputed_ameHouse_test
array([['TA'],
['TA'],
['TA'],
...,
['TA'],
['TA'],
['TA']], dtype=object)
ameHouse_test['imputed_BsmtCond'] = imputed_ameHouse_test
ameHouse_test.columns
ameHouse_test.drop('BsmtCond',axis=1,inplace=True)
ameHouse_test.shape
(1459, 75)
ameHouse_test['imputed_BsmtCond']
0 TA
1 TA
2 TA
3 TA
4 TA
..
1454 TA
1455 TA
1456 TA
1457 TA
1458 TA
Name: imputed_BsmtCond, Length: 1459, dtype: object
ameHouse_test.isnull().sum()
Id 0
MSSubClass 0
MSZoning 4
LotFrontage 227
LotArea 0
...
MoSold 0
YrSold 0
SaleType 1
SaleCondition 0
imputed_BsmtCond 0
Length: 75, dtype: int64
# Electrical Missing Value
miss_elec = train_X.loc[train_X['Electrical'].isnull()].index
miss_elec
Int64Index([1379], dtype='int64')
figElec,elc = plt.subplots(1,2, figsize=(15,5))
sns.countplot(ax = elc[0],x='Electrical',hue="MSZoning",data=train_X).set_title('Electrical System by MSZoning')
elc[0].legend(loc="upper right",title='MSZoning')
sns.countplot(ax=elc[1],x='Electrical',hue='MSSubClass',data=train_X).set_title('Electrical System by MSSubClass')
elc[1].legend(loc="upper right",title='MSSubClass')
plt.show()
# From these two visualization, we acquire insight about which electrical system used by house from across certain category most
train_X.loc[miss_elec,['MSZoning','MSSubClass']]
# Impute the missing value with SBrkr
train_X.loc[miss_elec,'Electrical'] = 'SBrkr'
print("Missing value of Electrical: {} ".format(train_X['Electrical'].isnull().sum()))
Missing value of Electrical: 0
# Let's do the same for validation and test dataset
print("Missing value of Electrical in validation dataset : {}".format(val_X['Electrical'].isnull().sum()))
print("Missing value of Electrical in test dataset : {}".format(ameHouse_test['Electrical'].isnull().sum()))
Missing value of Electrical in validation dataset : 0 Missing value of Electrical in test dataset : 0
#At first glance, the Garage feature has the same imputation properties as basement and Masonry. But look in more detail,
#and we realize that there is one feature called GarageYrBlt that cannot be imputed arbitrarily. Logically, when there is
#no garage, there is no development. Imputing the GarageYrBlt variable with various techniques, as far as I know,
# will only make this feature lose its essence. Therefore, I decided to ditch the row where these features miss.
miss_garage = train_X.loc[train_X['GarageType'].isnull()].index.tolist()
train_X.loc[miss_garage,['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond','GarageCars']]
| GarageType | GarageYrBlt | GarageFinish | GarageQual | GarageCond | GarageCars | |
|---|---|---|---|---|---|---|
| 431 | NaN | NaN | NaN | NaN | NaN | 0 |
| 638 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1337 | NaN | NaN | NaN | NaN | NaN | 0 |
| 528 | NaN | NaN | NaN | NaN | NaN | 0 |
| 960 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1011 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1283 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1038 | NaN | NaN | NaN | NaN | NaN | 0 |
| 165 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1349 | NaN | NaN | NaN | NaN | NaN | 0 |
| 533 | NaN | NaN | NaN | NaN | NaN | 0 |
| 386 | NaN | NaN | NaN | NaN | NaN | 0 |
| 88 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1453 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1449 | NaN | NaN | NaN | NaN | NaN | 0 |
| 784 | NaN | NaN | NaN | NaN | NaN | 0 |
| 495 | NaN | NaN | NaN | NaN | NaN | 0 |
| 942 | NaN | NaN | NaN | NaN | NaN | 0 |
| 127 | NaN | NaN | NaN | NaN | NaN | 0 |
| 441 | NaN | NaN | NaN | NaN | NaN | 0 |
| 210 | NaN | NaN | NaN | NaN | NaN | 0 |
| 970 | NaN | NaN | NaN | NaN | NaN | 0 |
| 48 | NaN | NaN | NaN | NaN | NaN | 0 |
| 636 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1123 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1234 | NaN | NaN | NaN | NaN | NaN | 0 |
| 78 | NaN | NaN | NaN | NaN | NaN | 0 |
| 250 | NaN | NaN | NaN | NaN | NaN | 0 |
| 614 | NaN | NaN | NaN | NaN | NaN | 0 |
| 968 | NaN | NaN | NaN | NaN | NaN | 0 |
| 582 | NaN | NaN | NaN | NaN | NaN | 0 |
| 155 | NaN | NaN | NaN | NaN | NaN | 0 |
| 125 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1131 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1137 | NaN | NaN | NaN | NaN | NaN | 0 |
| 375 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1219 | NaN | NaN | NaN | NaN | NaN | 0 |
| 635 | NaN | NaN | NaN | NaN | NaN | 0 |
| 738 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1009 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1179 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1325 | NaN | NaN | NaN | NaN | NaN | 0 |
| 163 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1407 | NaN | NaN | NaN | NaN | NaN | 0 |
| 750 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1218 | NaN | NaN | NaN | NaN | NaN | 0 |
| 710 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1326 | NaN | NaN | NaN | NaN | NaN | 0 |
| 287 | NaN | NaN | NaN | NaN | NaN | 0 |
| 843 | NaN | NaN | NaN | NaN | NaN | 0 |
| 307 | NaN | NaN | NaN | NaN | NaN | 0 |
| 562 | NaN | NaN | NaN | NaN | NaN | 0 |
| 954 | NaN | NaN | NaN | NaN | NaN | 0 |
| 1143 | NaN | NaN | NaN | NaN | NaN | 0 |
| 291 | NaN | NaN | NaN | NaN | NaN | 0 |
| 99 | NaN | NaN | NaN | NaN | NaN | 0 |
| 976 | NaN | NaN | NaN | NaN | NaN | 0 |
| 705 | NaN | NaN | NaN | NaN | NaN | 0 |
train_X.drop(index=miss_garage, inplace=True)
train_X[['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond','GarageCars']].isnull().sum()
GarageType 0 GarageYrBlt 0 GarageFinish 0 GarageQual 0 GarageCond 0 GarageCars 0 dtype: int64
# Check all features
train_X[miss_col].isnull().sum()
# Check the data dimension
train_X.shape # From 1168 to 1110, i lost 58 rows, but fortunately, still retain 95% data
(1110, 75)
# Garage Features in validation and test dataset
print(val_X[['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond']].isnull().sum(),end="\n\n")
print(ameHouse_test[['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond']].isnull().sum())
GarageType 23 GarageYrBlt 23 GarageFinish 23 GarageQual 23 GarageCond 23 dtype: int64 GarageType 76 GarageYrBlt 78 GarageFinish 78 GarageQual 78 GarageCond 78 dtype: int64
Valid_miss_garage = val_X.loc[val_X['GarageType'].isnull()].index.tolist()
val_X.loc[Valid_miss_garage,['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond']]
| GarageType | GarageYrBlt | GarageFinish | GarageQual | GarageCond | |
|---|---|---|---|---|---|
| 89 | NaN | NaN | NaN | NaN | NaN |
| 649 | NaN | NaN | NaN | NaN | NaN |
| 393 | NaN | NaN | NaN | NaN | NaN |
| 535 | NaN | NaN | NaN | NaN | NaN |
| 1173 | NaN | NaN | NaN | NaN | NaN |
| 1096 | NaN | NaN | NaN | NaN | NaN |
| 148 | NaN | NaN | NaN | NaN | NaN |
| 921 | NaN | NaN | NaN | NaN | NaN |
| 613 | NaN | NaN | NaN | NaN | NaN |
| 108 | NaN | NaN | NaN | NaN | NaN |
| 464 | NaN | NaN | NaN | NaN | NaN |
| 1030 | NaN | NaN | NaN | NaN | NaN |
| 620 | NaN | NaN | NaN | NaN | NaN |
| 826 | NaN | NaN | NaN | NaN | NaN |
| 434 | NaN | NaN | NaN | NaN | NaN |
| 140 | NaN | NaN | NaN | NaN | NaN |
| 1450 | NaN | NaN | NaN | NaN | NaN |
| 39 | NaN | NaN | NaN | NaN | NaN |
| 1257 | NaN | NaN | NaN | NaN | NaN |
| 241 | NaN | NaN | NaN | NaN | NaN |
| 520 | NaN | NaN | NaN | NaN | NaN |
| 198 | NaN | NaN | NaN | NaN | NaN |
| 1323 | NaN | NaN | NaN | NaN | NaN |
# drop all Garage Features with missing value
val_X.drop(index=Valid_miss_garage, inplace=True)
val_X[['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond','GarageCars']].isnull().sum()
GarageType 0 GarageYrBlt 0 GarageFinish 0 GarageQual 0 GarageCond 0 GarageCars 0 dtype: int64
val_X.shape # I able to keep 92,12% of the data after drop imputation
(269, 75)
Test_miss_garage = ameHouse_test.loc[ameHouse_test['GarageFinish'].isnull()].index.tolist()
print(ameHouse_test.loc[Test_miss_garage,['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond']])
GarageType GarageYrBlt GarageFinish GarageQual GarageCond 53 NaN NaN NaN NaN NaN 71 NaN NaN NaN NaN NaN 79 NaN NaN NaN NaN NaN 92 NaN NaN NaN NaN NaN 96 NaN NaN NaN NaN NaN ... ... ... ... ... ... 1433 NaN NaN NaN NaN NaN 1449 NaN NaN NaN NaN NaN 1453 NaN NaN NaN NaN NaN 1454 NaN NaN NaN NaN NaN 1457 NaN NaN NaN NaN NaN [78 rows x 5 columns]
ameHouse_test.loc[(ameHouse_test['GarageType'].notnull()) & (ameHouse_test['GarageFinish'].isnull())]
ameHouse_test.loc[[666,1116],['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond']]
| GarageType | GarageYrBlt | GarageFinish | GarageQual | GarageCond | |
|---|---|---|---|---|---|
| 666 | Detchd | NaN | NaN | NaN | NaN |
| 1116 | Detchd | NaN | NaN | NaN | NaN |
# There is a litle confusion from test dataset, but i consider that losing 78 rows or trying to imputate two rows is clear
# decision line. I decide to drop all 78 rows, because it's simpler and the cost are relatively low.
#ameHouse_test.drop(index=Test_miss_garage, inplace=True)
ameHouse_test[['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond','GarageCars']].isnull().sum()
GarageType 76 GarageYrBlt 78 GarageFinish 78 GarageQual 78 GarageCond 78 GarageCars 1 dtype: int64
ameHouse_test.shape # i still have 94,65 of the original dataset
(1459, 75)
# Check for missing values in train and validation data
columns_miss_train = len([col for col in train_X.columns if train_X[col].isnull().sum()>0])
columns_miss_val = len([col for col in val_X.columns if val_X[col].isnull().sum()>0])
columns_miss_test = len([col for col in ameHouse_test.columns if ameHouse_test[col].isnull().sum()>0])
print(columns_miss_train)
print(columns_miss_val)
print(columns_miss_test)
0 0 21
# Check for miss value columns in test dataset
miss_test_col = [col for col in ameHouse_test.columns if ameHouse_test[col].isnull().sum()>0]
ameHouse_test[miss_test_col].isnull().sum()
MSZoning 4 LotFrontage 227 Utilities 2 Exterior1st 1 Exterior2nd 1 BsmtFinSF1 1 BsmtFinSF2 1 BsmtUnfSF 1 TotalBsmtSF 1 BsmtFullBath 2 BsmtHalfBath 2 KitchenQual 1 Functional 2 GarageType 76 GarageYrBlt 78 GarageFinish 78 GarageCars 1 GarageArea 1 GarageQual 78 GarageCond 78 SaleType 1 dtype: int64
# For efficiency, i decide to use KNN Imputation technique for numerical features and simpleimputer "most_frequent" for string
from sklearn.impute import KNNImputer
KnnImp = KNNImputer(n_neighbors = 6)
ames_test_cat =ameHouse_test.select_dtypes("object").columns.tolist()
ames_test_num =ameHouse_test.select_dtypes(["int64","float64"]).columns.tolist()
# ColumnTransformer for two Columns
catTestImputer = SimpleImputer(strategy = 'most_frequent')
test_transformer = ColumnTransformer([('imputed_cat',catTestImputer,ames_test_cat),
('imputed_num',KnnImp,ames_test_num)])
test_imputed_ames = test_transformer.fit_transform(ameHouse_test)
new_test_col = ameHouse_test.columns
new_test_col
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtExposure',
'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF',
'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
'1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageType',
'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'imputed_BsmtCond'],
dtype='object')
imputed_ames_test_new = pd.DataFrame(test_imputed_ames, columns=new_test_col)
new_col =ames_test_cat + ames_test_num
imputed_ames_test_new = pd.DataFrame(test_imputed_ames, columns=new_col).reset_index()
imputed_ames_test_new.drop('Id',axis=1,inplace=True)
imputed_ames_test_new
| index | MSZoning | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | ... | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | RH | Pave | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Feedr | ... | 730.0 | 140.0 | 0.0 | 0.0 | 0.0 | 120.0 | 0.0 | 0.0 | 6.0 | 2010.0 |
| 1 | 1 | RL | Pave | IR1 | Lvl | AllPub | Corner | Gtl | NAmes | Norm | ... | 312.0 | 393.0 | 36.0 | 0.0 | 0.0 | 0.0 | 0.0 | 12500.0 | 6.0 | 2010.0 |
| 2 | 2 | RL | Pave | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | ... | 482.0 | 212.0 | 34.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 2010.0 |
| 3 | 3 | RL | Pave | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | ... | 470.0 | 360.0 | 36.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 2010.0 |
| 4 | 4 | RL | Pave | IR1 | HLS | AllPub | Inside | Gtl | StoneBr | Norm | ... | 506.0 | 0.0 | 82.0 | 0.0 | 0.0 | 144.0 | 0.0 | 0.0 | 1.0 | 2010.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1454 | 1454 | RM | Pave | Reg | Lvl | AllPub | Inside | Gtl | MeadowV | Norm | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 2006.0 |
| 1455 | 1455 | RM | Pave | Reg | Lvl | AllPub | Inside | Gtl | MeadowV | Norm | ... | 286.0 | 0.0 | 24.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 2006.0 |
| 1456 | 1456 | RL | Pave | Reg | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | ... | 576.0 | 474.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 | 2006.0 |
| 1457 | 1457 | RL | Pave | Reg | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | ... | 0.0 | 80.0 | 32.0 | 0.0 | 0.0 | 0.0 | 0.0 | 700.0 | 7.0 | 2006.0 |
| 1458 | 1458 | RL | Pave | Reg | Lvl | AllPub | Inside | Mod | Mitchel | Norm | ... | 650.0 | 190.0 | 48.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 11.0 | 2006.0 |
1459 rows × 75 columns
# Check for missing value
print("Missing value in test dataset: {}".format(len([col for col in imputed_ames_test_new.columns if imputed_ames_test_new[col].isnull().sum()>0])))
print("Missing value in train dataset: {}".format(len([col for col in train_X.columns if train_X[col].isnull().sum()>0])))
print("Missing value in validation dataset: {}".format(len([col for col in val_X.columns if val_X[col].isnull().sum()])))
Missing value in test dataset: 0 Missing value in train dataset: 0 Missing value in validation dataset: 0
Now that we are done with missing value in all dataset, we proceed to the next steps. That is, another cleaning. Here i explain what i would do for the next few steps.
After that, i would continue with FEATURE ENGINEERING, MODEL BUILDING, MODEL ENSEMBLING, and finally, draw one or some conclusions from the data.
Typically, there are two method to work with categorical/ordinal features (3 actually, but i ditch it because it just about drop non-numerical features), i.e:
The difference between the two lies in the output of the data encoding. Ordinal encoding will convert category data into ordinal data, while one-hot encoding will convert category data into nominal data, where each category will have its own column.
print("Object-type data column in train_X :{}".format(len(train_X.select_dtypes('object').columns))) # 38
print(train_X.select_dtypes('object').nunique())
ames_cat_col = train_X.select_dtypes('object').columns.tolist()
Object-type data column in train_X :38 MSZoning 5 Street 2 LotShape 4 LandContour 4 Utilities 2 LotConfig 5 LandSlope 3 Neighborhood 25 Condition1 9 Condition2 6 BldgType 5 HouseStyle 8 RoofStyle 6 RoofMatl 7 Exterior1st 14 Exterior2nd 15 MasVnrType 4 ExterQual 4 ExterCond 4 Foundation 6 BsmtQual 5 BsmtCond 5 BsmtExposure 5 BsmtFinType1 7 BsmtFinType2 7 Heating 6 HeatingQC 5 CentralAir 2 Electrical 5 KitchenQual 4 Functional 6 GarageType 6 GarageFinish 3 GarageQual 5 GarageCond 5 PavedDrive 3 SaleType 8 SaleCondition 6 dtype: int64
# Another thing we need to know especially for the Ames, Iowa housing data, is that there are numerical features that we can
# treat as categorical data, or specifically as ordinal data type.
# To choose the right features that we would treat as ordinal, first, we filter the one with low cardinality,
# and second, we read the data documentation to make sure that they are indeed have orndinal nature.
low_num_card = []
count_num = train_X.select_dtypes(['int64','float64']).nunique()
for i in count_num.index:
if count_num[i] <= 15:
low_num_card.append(i)
low_num_card
['MSSubClass', 'OverallQual', 'OverallCond', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'PoolArea', 'MoSold', 'YrSold']
There are 15 numerical features with low cardinality. Then, after double check with data documentation, here is the result:
So, 12 out of 15 low cardinality numeric features can be treated as Ordinal/nominal features. Hence, i will take them out from numeric features group later.
ames_cat_col = ames_cat_col + low_num_card[:11]
ames_cat_col
['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'SaleType', 'SaleCondition', 'MSSubClass', 'OverallQual', 'OverallCond', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces']
# From categorical / ordinal group of features, i need to decide which one should be encoded with Ordinal or One-Hot technique
# based on it's nature
# Usually i need to read the data documentation to understand what each features represents
# But, there are too many features. So, let's take the most efficient strategy
# Basically, almost all features with string value would be considered to be encoded with One-Hot encoding,
# Exception for some, features like 'ExterQual','ExterCond', 'BsmtCond', 'BsmtQual','BsmtExposure','BsmtFinType1',
# 'BsmtFinType2','HeatingQC','KitchenQual','GarageQual', and 'GarageCond' would be encoded with One-Hot encoding
# The reason why i encode some features above with Ordinal method is because they inherently represents quality level,
# and presummably, those leveling would better represented as ordinal data.
# Ordinal Encoding
ame_cat_to_ordinal = ['ExterQual','ExterCond', 'BsmtCond', 'BsmtQual','BsmtExposure','BsmtFinType1','BsmtFinType2','HeatingQC',
'KitchenQual','GarageQual','GarageCond']+ames_cat_col[38:]
# create a copy for this column in different dataframe
label_ordinal_train_X = train_X[ame_cat_to_ordinal].copy(deep=True)
label_ordinal_val_X = val_X[ame_cat_to_ordinal].copy(deep=True)
# for test dataset
imputed_ames_test_new.rename(columns={'index':'Id','imputed_BsmtCond':'BsmtCond'},inplace=True)
label_ordinal_test = imputed_ames_test_new[ame_cat_to_ordinal].copy(deep=True)
label_ordinal_test.nunique()
ExterQual 4 ExterCond 5 BsmtCond 5 BsmtQual 5 BsmtExposure 5 BsmtFinType1 7 BsmtFinType2 7 HeatingQC 5 KitchenQual 4 GarageQual 4 GarageCond 5 MSSubClass 16 OverallQual 10 OverallCond 9 BsmtFullBath 5 BsmtHalfBath 3 FullBath 5 HalfBath 3 BedroomAbvGr 7 KitchenAbvGr 3 TotRmsAbvGrd 12 Fireplaces 5 dtype: int64
label_ordinal_val_X.nunique()
ExterQual 4 ExterCond 3 BsmtCond 4 BsmtQual 5 BsmtExposure 5 BsmtFinType1 7 BsmtFinType2 7 HeatingQC 4 KitchenQual 4 GarageQual 4 GarageCond 5 MSSubClass 14 OverallQual 8 OverallCond 7 BsmtFullBath 3 BsmtHalfBath 3 FullBath 4 HalfBath 3 BedroomAbvGr 7 KitchenAbvGr 2 TotRmsAbvGrd 9 Fireplaces 4 dtype: int64
# import OrdinalEncoder module
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder(handle_unknown='use_encoded_value',unknown_value=15)
label_ordinal_train_X[:] = ordinal_encoder.fit_transform(label_ordinal_train_X)
label_ordinal_val_X[:] = ordinal_encoder.transform(label_ordinal_val_X)
# For test dataset
label_ordinal_test[:] = ordinal_encoder.transform(label_ordinal_test)
label_ordinal_train_X
label_ordinal_val_X
label_ordinal_test
| ExterQual | ExterCond | BsmtCond | BsmtQual | BsmtExposure | BsmtFinType1 | BsmtFinType2 | HeatingQC | KitchenQual | GarageQual | ... | OverallQual | OverallCond | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | TotRmsAbvGrd | Fireplaces | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 5.0 | 3.0 | 4.0 | 3.0 | 4.0 | ... | 3.0 | 4.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | 0.0 | 2.0 | 0.0 |
| 1 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 0.0 | 6.0 | 4.0 | 2.0 | 4.0 | ... | 4.0 | 4.0 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 | 0.0 | 3.0 | 0.0 |
| 2 | 3.0 | 3.0 | 4.0 | 2.0 | 3.0 | 2.0 | 6.0 | 2.0 | 3.0 | 4.0 | ... | 3.0 | 3.0 | 0.0 | 0.0 | 2.0 | 1.0 | 3.0 | 0.0 | 3.0 | 1.0 |
| 3 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 2.0 | 6.0 | 0.0 | 2.0 | 4.0 | ... | 4.0 | 4.0 | 0.0 | 0.0 | 2.0 | 1.0 | 3.0 | 0.0 | 4.0 | 1.0 |
| 4 | 2.0 | 3.0 | 4.0 | 2.0 | 3.0 | 0.0 | 6.0 | 0.0 | 2.0 | 4.0 | ... | 6.0 | 3.0 | 0.0 | 0.0 | 2.0 | 0.0 | 2.0 | 0.0 | 2.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1454 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 6.0 | 6.0 | 2.0 | 3.0 | 4.0 | ... | 2.0 | 5.0 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 | 0.0 | 2.0 | 0.0 |
| 1455 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 5.0 | 6.0 | 4.0 | 3.0 | 4.0 | ... | 2.0 | 3.0 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 | 0.0 | 3.0 | 0.0 |
| 1456 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 0.0 | 6.0 | 0.0 | 3.0 | 4.0 | ... | 3.0 | 5.0 | 1.0 | 0.0 | 1.0 | 0.0 | 4.0 | 0.0 | 4.0 | 1.0 |
| 1457 | 3.0 | 3.0 | 4.0 | 2.0 | 0.0 | 2.0 | 6.0 | 4.0 | 3.0 | 4.0 | ... | 3.0 | 3.0 | 0.0 | 1.0 | 1.0 | 0.0 | 3.0 | 0.0 | 3.0 | 0.0 |
| 1458 | 3.0 | 3.0 | 4.0 | 2.0 | 0.0 | 3.0 | 6.0 | 0.0 | 3.0 | 4.0 | ... | 5.0 | 3.0 | 0.0 | 0.0 | 2.0 | 1.0 | 3.0 | 0.0 | 6.0 | 1.0 |
1459 rows × 22 columns
ame_cat_to_onehot = [col for col in ames_cat_col if col not in ame_cat_to_ordinal]
label_onehot_train_X = train_X[ame_cat_to_onehot].copy(deep=True)
label_onehot_val_X = val_X[ame_cat_to_onehot].copy(deep=True)
# For test dataset
label_onehot_test = imputed_ames_test_new[ame_cat_to_onehot].copy(deep=True)
label_onehot_test
| MSZoning | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | ... | Foundation | Heating | CentralAir | Electrical | Functional | GarageType | GarageFinish | PavedDrive | SaleType | SaleCondition | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | RH | Pave | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Feedr | Norm | ... | CBlock | GasA | Y | SBrkr | Typ | Attchd | Unf | Y | WD | Normal |
| 1 | RL | Pave | IR1 | Lvl | AllPub | Corner | Gtl | NAmes | Norm | Norm | ... | CBlock | GasA | Y | SBrkr | Typ | Attchd | Unf | Y | WD | Normal |
| 2 | RL | Pave | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | ... | PConc | GasA | Y | SBrkr | Typ | Attchd | Fin | Y | WD | Normal |
| 3 | RL | Pave | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | ... | PConc | GasA | Y | SBrkr | Typ | Attchd | Fin | Y | WD | Normal |
| 4 | RL | Pave | IR1 | HLS | AllPub | Inside | Gtl | StoneBr | Norm | Norm | ... | PConc | GasA | Y | SBrkr | Typ | Attchd | RFn | Y | WD | Normal |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1454 | RM | Pave | Reg | Lvl | AllPub | Inside | Gtl | MeadowV | Norm | Norm | ... | CBlock | GasA | Y | SBrkr | Typ | Attchd | Unf | Y | WD | Normal |
| 1455 | RM | Pave | Reg | Lvl | AllPub | Inside | Gtl | MeadowV | Norm | Norm | ... | CBlock | GasA | Y | SBrkr | Typ | CarPort | Unf | Y | WD | Abnorml |
| 1456 | RL | Pave | Reg | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | ... | CBlock | GasA | Y | SBrkr | Typ | Detchd | Unf | Y | WD | Abnorml |
| 1457 | RL | Pave | Reg | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | ... | PConc | GasA | Y | SBrkr | Typ | Attchd | Unf | Y | WD | Normal |
| 1458 | RL | Pave | Reg | Lvl | AllPub | Inside | Mod | Mitchel | Norm | Norm | ... | PConc | GasA | Y | SBrkr | Typ | Attchd | Fin | Y | WD | Normal |
1459 rows × 27 columns
label_onehot_train_X
| MSZoning | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | ... | Foundation | Heating | CentralAir | Electrical | Functional | GarageType | GarageFinish | PavedDrive | SaleType | SaleCondition | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 618 | RL | Pave | Reg | Lvl | AllPub | Inside | Gtl | NridgHt | Norm | Norm | ... | PConc | GasA | Y | SBrkr | Typ | Attchd | Unf | Y | New | Partial |
| 870 | RL | Pave | Reg | Lvl | AllPub | Inside | Gtl | NAmes | PosN | Norm | ... | CBlock | GasA | N | SBrkr | Typ | Detchd | Unf | Y | WD | Normal |
| 92 | RL | Pave | IR1 | HLS | AllPub | Inside | Gtl | Crawfor | Norm | Norm | ... | BrkTil | GasA | Y | SBrkr | Typ | Detchd | Unf | Y | WD | Normal |
| 817 | RL | Pave | IR1 | Lvl | AllPub | CulDSac | Gtl | Mitchel | Norm | Norm | ... | PConc | GasA | Y | SBrkr | Typ | Attchd | RFn | Y | WD | Normal |
| 302 | RL | Pave | IR1 | Lvl | AllPub | Corner | Gtl | CollgCr | Norm | Norm | ... | PConc | GasA | Y | SBrkr | Typ | Attchd | RFn | Y | WD | Normal |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 763 | RL | Pave | Reg | Lvl | AllPub | Inside | Gtl | NoRidge | Norm | Norm | ... | PConc | GasA | Y | SBrkr | Typ | Attchd | RFn | Y | WD | Normal |
| 835 | RL | Pave | Reg | Lvl | AllPub | Inside | Gtl | Sawyer | Norm | Norm | ... | CBlock | GasA | Y | SBrkr | Min2 | Attchd | Unf | Y | WD | Normal |
| 1216 | RM | Pave | Reg | Lvl | AllPub | Inside | Gtl | Sawyer | RRAe | Norm | ... | Slab | GasA | Y | SBrkr | Typ | Attchd | Unf | Y | WD | Normal |
| 559 | RL | Pave | Reg | Lvl | AllPub | Inside | Gtl | Blmngtn | Norm | Norm | ... | PConc | GasA | Y | SBrkr | Typ | Attchd | Fin | Y | WD | Normal |
| 684 | RL | Pave | IR2 | Lvl | AllPub | CulDSac | Gtl | NoRidge | Norm | Norm | ... | PConc | GasA | Y | SBrkr | Typ | Attchd | Fin | Y | WD | Normal |
1110 rows × 27 columns
# One Hot Encoder
from sklearn.preprocessing import OneHotEncoder
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(label_onehot_train_X[:]))
OH_cols_val = pd.DataFrame(OH_encoder.transform(label_onehot_val_X[:]))
# For test dataset
OH_cols_test = pd.DataFrame(OH_encoder.transform(label_onehot_test[:]))
OH_cols_test.index = label_onehot_test.index
OH_cols_test
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1454 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1455 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1456 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1457 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1458 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1459 rows × 175 columns
OH_cols_train.index = label_onehot_train_X.index
OH_cols_val.index = label_onehot_val_X.index
OH_encoder.inverse_transform(OH_cols_train)
OH_col_names = OH_encoder.get_feature_names().tolist()
C:\Users\Galih Dwika Putra R\AppData\Roaming\Python\Python39\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
# Change the columns name
OH_cols_train.columns = OH_col_names
OH_cols_val.columns = OH_col_names
OH_cols_test.columns = OH_col_names
OH_cols_test
| x0_C (all) | x0_FV | x0_RH | x0_RL | x0_RM | x1_Grvl | x1_Pave | x2_IR1 | x2_IR2 | x2_IR3 | ... | x25_ConLI | x25_ConLw | x25_New | x25_WD | x26_Abnorml | x26_AdjLand | x26_Alloca | x26_Family | x26_Normal | x26_Partial | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1454 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1455 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1456 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1457 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1458 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1459 rows × 175 columns
# join two data into one dataframe filled with transformed categorical features
ames_transform_all_cat_train = pd.concat([label_ordinal_train_X,OH_cols_train],axis=1)
ames_transform_all_cat_val = pd.concat([label_ordinal_val_X,OH_cols_val],axis=1)
# For test dataset
ames_transform_all_cat_test = pd.concat([label_ordinal_test,OH_cols_test],axis=1)
ames_transform_all_cat_test
| ExterQual | ExterCond | BsmtCond | BsmtQual | BsmtExposure | BsmtFinType1 | BsmtFinType2 | HeatingQC | KitchenQual | GarageQual | ... | x25_ConLI | x25_ConLw | x25_New | x25_WD | x26_Abnorml | x26_AdjLand | x26_Alloca | x26_Family | x26_Normal | x26_Partial | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 5.0 | 3.0 | 4.0 | 3.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 0.0 | 6.0 | 4.0 | 2.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 3.0 | 3.0 | 4.0 | 2.0 | 3.0 | 2.0 | 6.0 | 2.0 | 3.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 2.0 | 6.0 | 0.0 | 2.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | 2.0 | 3.0 | 4.0 | 2.0 | 3.0 | 0.0 | 6.0 | 0.0 | 2.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1454 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 6.0 | 6.0 | 2.0 | 3.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1455 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 5.0 | 6.0 | 4.0 | 3.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1456 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 0.0 | 6.0 | 0.0 | 3.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1457 | 3.0 | 3.0 | 4.0 | 2.0 | 0.0 | 2.0 | 6.0 | 4.0 | 3.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1458 | 3.0 | 3.0 | 4.0 | 2.0 | 0.0 | 3.0 | 6.0 | 0.0 | 3.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1459 rows × 197 columns
ames_transform_all_cat_train.head()
| ExterQual | ExterCond | BsmtCond | BsmtQual | BsmtExposure | BsmtFinType1 | BsmtFinType2 | HeatingQC | KitchenQual | GarageQual | ... | x25_ConLI | x25_ConLw | x25_New | x25_WD | x26_Abnorml | x26_AdjLand | x26_Alloca | x26_Family | x26_Normal | x26_Partial | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 618 | 0.0 | 3.0 | 4.0 | 0.0 | 0.0 | 2.0 | 6.0 | 0.0 | 2.0 | 4.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 870 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 6.0 | 6.0 | 2.0 | 3.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 92 | 3.0 | 2.0 | 4.0 | 2.0 | 3.0 | 0.0 | 6.0 | 0.0 | 3.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 817 | 2.0 | 3.0 | 4.0 | 2.0 | 3.0 | 2.0 | 6.0 | 0.0 | 2.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 302 | 2.0 | 3.0 | 4.0 | 2.0 | 3.0 | 6.0 | 6.0 | 0.0 | 2.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows × 197 columns
ames_transform_all_cat_val.head()
| ExterQual | ExterCond | BsmtCond | BsmtQual | BsmtExposure | BsmtFinType1 | BsmtFinType2 | HeatingQC | KitchenQual | GarageQual | ... | x25_ConLI | x25_ConLw | x25_New | x25_WD | x26_Abnorml | x26_AdjLand | x26_Alloca | x26_Family | x26_Normal | x26_Partial | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 529 | 2.0 | 3.0 | 4.0 | 4.0 | 3.0 | 5.0 | 6.0 | 4.0 | 3.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 491 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 1.0 | 5.0 | 4.0 | 1.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 459 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 3.0 | 6.0 | 4.0 | 2.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 279 | 3.0 | 3.0 | 4.0 | 2.0 | 3.0 | 1.0 | 6.0 | 0.0 | 3.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 655 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 6.0 | 6.0 | 4.0 | 3.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
5 rows × 197 columns
# don't forget to drop all categorical/ordinal features from the original dataset
num_train_X = train_X.drop(ames_cat_col, axis = 1)
num_val_X = val_X.drop(ames_cat_col, axis = 1)
# for test dataset
num_test = imputed_ames_test_new.drop(ames_cat_col,axis=1)
num_test
| Id | LotFrontage | LotArea | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | ... | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 80.0 | 11622.0 | 1961.0 | 1961.0 | 0.0 | 468.0 | 144.0 | 270.0 | 882.0 | ... | 730.0 | 140.0 | 0.0 | 0.0 | 0.0 | 120.0 | 0.0 | 0.0 | 6.0 | 2010.0 |
| 1 | 1 | 81.0 | 14267.0 | 1958.0 | 1958.0 | 108.0 | 923.0 | 0.0 | 406.0 | 1329.0 | ... | 312.0 | 393.0 | 36.0 | 0.0 | 0.0 | 0.0 | 0.0 | 12500.0 | 6.0 | 2010.0 |
| 2 | 2 | 74.0 | 13830.0 | 1997.0 | 1998.0 | 0.0 | 791.0 | 0.0 | 137.0 | 928.0 | ... | 482.0 | 212.0 | 34.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 2010.0 |
| 3 | 3 | 78.0 | 9978.0 | 1998.0 | 1998.0 | 20.0 | 602.0 | 0.0 | 324.0 | 926.0 | ... | 470.0 | 360.0 | 36.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 2010.0 |
| 4 | 4 | 43.0 | 5005.0 | 1992.0 | 1992.0 | 0.0 | 263.0 | 0.0 | 1017.0 | 1280.0 | ... | 506.0 | 0.0 | 82.0 | 0.0 | 0.0 | 144.0 | 0.0 | 0.0 | 1.0 | 2010.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1454 | 1454 | 21.0 | 1936.0 | 1970.0 | 1970.0 | 0.0 | 0.0 | 0.0 | 546.0 | 546.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 2006.0 |
| 1455 | 1455 | 21.0 | 1894.0 | 1970.0 | 1970.0 | 0.0 | 252.0 | 0.0 | 294.0 | 546.0 | ... | 286.0 | 0.0 | 24.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 2006.0 |
| 1456 | 1456 | 160.0 | 20000.0 | 1960.0 | 1996.0 | 0.0 | 1224.0 | 0.0 | 0.0 | 1224.0 | ... | 576.0 | 474.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 | 2006.0 |
| 1457 | 1457 | 62.0 | 10441.0 | 1992.0 | 1992.0 | 0.0 | 337.0 | 0.0 | 575.0 | 912.0 | ... | 0.0 | 80.0 | 32.0 | 0.0 | 0.0 | 0.0 | 0.0 | 700.0 | 7.0 | 2006.0 |
| 1458 | 1458 | 74.0 | 9627.0 | 1993.0 | 1994.0 | 94.0 | 758.0 | 0.0 | 238.0 | 996.0 | ... | 650.0 | 190.0 | 48.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 11.0 | 2006.0 |
1459 rows × 26 columns
num_train_X.head()
| Id | LotFrontage | LotArea | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | ... | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 618 | 619 | 90.0 | 11694 | 2007 | 2007 | 452.0 | 48 | 0 | 1774 | 1822 | ... | 774 | 0 | 108 | 0 | 0 | 260 | 0 | 0 | 7 | 2007 |
| 870 | 871 | 60.0 | 6600 | 1962 | 1962 | 0.0 | 0 | 0 | 894 | 894 | ... | 308 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 2009 |
| 92 | 93 | 80.0 | 13360 | 1921 | 2006 | 0.0 | 713 | 0 | 163 | 876 | ... | 432 | 0 | 0 | 44 | 0 | 0 | 0 | 0 | 8 | 2009 |
| 817 | 818 | 75.0 | 13265 | 2002 | 2002 | 148.0 | 1218 | 0 | 350 | 1568 | ... | 857 | 150 | 59 | 0 | 0 | 0 | 0 | 0 | 7 | 2008 |
| 302 | 303 | 118.0 | 13704 | 2001 | 2002 | 150.0 | 0 | 0 | 1541 | 1541 | ... | 843 | 468 | 81 | 0 | 0 | 0 | 0 | 0 | 1 | 2006 |
5 rows × 26 columns
# Now merge the only-number dataframe with transformed categorical dataframe
all_num_cat_trans_train_X = pd.concat([num_train_X,ames_transform_all_cat_train],axis=1)
all_num_cat_trans_val_X = pd.concat([num_val_X,ames_transform_all_cat_val],axis=1)
# For test dataset
all_num_cat_trans_test = pd.concat([num_test,ames_transform_all_cat_test],axis=1)
all_num_cat_trans_test.shape
(1459, 223)
print(all_num_cat_trans_train_X.shape)
print(all_num_cat_trans_val_X.shape)
(1110, 223) (269, 223)
We are done with categorical features. The features increase drastically, and perhaps would impact the model performance significantly, for better or worse. one thing for sure is all data features are now of numeric type. They all have different range and distribution; My next job is to examine and make sure this difference will not reduce model's performance.
In this part, i will do normality test using Shapiro-Wilk against only the continuous or 'original number features' so i called. Then, features that do not follow Gaussian distribution will go through a normalization procedure using Robust Scaler.
Side note: Shapiro-Wilk Test's hypothesis are: H0 : The population is distributed normally H1 : The population is not distributed normally
# first of all, i will make 'Id' as dataframe index
all_num_cat_trans_train_X = all_num_cat_trans_train_X.set_index('Id')
all_num_cat_trans_val_X = all_num_cat_trans_val_X.set_index('Id')
# For test dataset
all_num_cat_trans_test = all_num_cat_trans_test.set_index('Id')
# Next, i will exclude dummy and ordinal features from strandarization process
num_features = [col for col in all_num_cat_trans_train_X.columns if col in num_train_X]
len(num_features)
25
# import shapiro module
from scipy.stats import shapiro
# Perform the test
def ShapiroTest(df, sig):
p_val = []
decs = []
cols = []
for i in df.columns:
shap,res = shapiro(df[i])
cols.append(i)
p_val.append(res)
if res > sig:
decs.append("Normal Distribution")
else:
decs.append("Not Normal Distributrion")
res_df = pd.DataFrame({'Feature': cols, 'P-Value':p_val, 'Conclusion': decs})
return res_df
Shapiro_test_train = ShapiroTest(all_num_cat_trans_train_X[num_features],0.05)
Shapiro_test_train
| Feature | P-Value | Conclusion | |
|---|---|---|---|
| 0 | LotFrontage | 3.162906e-26 | Not Normal Distributrion |
| 1 | LotArea | 0.000000e+00 | Not Normal Distributrion |
| 2 | YearBuilt | 1.070809e-23 | Not Normal Distributrion |
| 3 | YearRemodAdd | 3.671455e-30 | Not Normal Distributrion |
| 4 | MasVnrArea | 5.507103e-43 | Not Normal Distributrion |
| 5 | BsmtFinSF1 | 2.696543e-27 | Not Normal Distributrion |
| 6 | BsmtFinSF2 | 0.000000e+00 | Not Normal Distributrion |
| 7 | BsmtUnfSF | 4.893147e-22 | Not Normal Distributrion |
| 8 | TotalBsmtSF | 2.406882e-17 | Not Normal Distributrion |
| 9 | 1stFlrSF | 3.416720e-19 | Not Normal Distributrion |
| 10 | 2ndFlrSF | 1.329707e-36 | Not Normal Distributrion |
| 11 | LowQualFinSF | 0.000000e+00 | Not Normal Distributrion |
| 12 | GrLivArea | 4.646039e-20 | Not Normal Distributrion |
| 13 | GarageYrBlt | 1.433124e-23 | Not Normal Distributrion |
| 14 | GarageCars | 2.738633e-36 | Not Normal Distributrion |
| 15 | GarageArea | 1.079923e-16 | Not Normal Distributrion |
| 16 | WoodDeckSF | 1.312211e-36 | Not Normal Distributrion |
| 17 | OpenPorchSF | 1.260988e-38 | Not Normal Distributrion |
| 18 | EnclosedPorch | 0.000000e+00 | Not Normal Distributrion |
| 19 | 3SsnPorch | 0.000000e+00 | Not Normal Distributrion |
| 20 | ScreenPorch | 0.000000e+00 | Not Normal Distributrion |
| 21 | PoolArea | 0.000000e+00 | Not Normal Distributrion |
| 22 | MiscVal | 0.000000e+00 | Not Normal Distributrion |
| 23 | MoSold | 3.222904e-15 | Not Normal Distributrion |
| 24 | YrSold | 1.196740e-26 | Not Normal Distributrion |
Shapiro_test_train['Conclusion'].value_counts() # All features do not follow Gaussian Distribution
Not Normal Distributrion 25 Name: Conclusion, dtype: int64
Shapiro_test_val = ShapiroTest(all_num_cat_trans_val_X[num_features],0.05)
Shapiro_test_val
| Feature | P-Value | Conclusion | |
|---|---|---|---|
| 0 | LotFrontage | 3.379728e-18 | Not Normal Distributrion |
| 1 | LotArea | 3.853450e-23 | Not Normal Distributrion |
| 2 | YearBuilt | 7.989654e-11 | Not Normal Distributrion |
| 3 | YearRemodAdd | 5.193416e-15 | Not Normal Distributrion |
| 4 | MasVnrArea | 2.387008e-22 | Not Normal Distributrion |
| 5 | BsmtFinSF1 | 2.171661e-20 | Not Normal Distributrion |
| 6 | BsmtFinSF2 | 1.714818e-30 | Not Normal Distributrion |
| 7 | BsmtUnfSF | 2.323560e-11 | Not Normal Distributrion |
| 8 | TotalBsmtSF | 1.464569e-17 | Not Normal Distributrion |
| 9 | 1stFlrSF | 3.761116e-15 | Not Normal Distributrion |
| 10 | 2ndFlrSF | 1.013616e-19 | Not Normal Distributrion |
| 11 | LowQualFinSF | 1.358333e-34 | Not Normal Distributrion |
| 12 | GrLivArea | 1.549194e-15 | Not Normal Distributrion |
| 13 | GarageYrBlt | 4.751121e-11 | Not Normal Distributrion |
| 14 | GarageCars | 3.655624e-18 | Not Normal Distributrion |
| 15 | GarageArea | 8.067395e-09 | Not Normal Distributrion |
| 16 | WoodDeckSF | 2.535409e-18 | Not Normal Distributrion |
| 17 | OpenPorchSF | 4.363552e-19 | Not Normal Distributrion |
| 18 | EnclosedPorch | 6.642034e-30 | Not Normal Distributrion |
| 19 | 3SsnPorch | 1.272099e-33 | Not Normal Distributrion |
| 20 | ScreenPorch | 2.572267e-30 | Not Normal Distributrion |
| 21 | PoolArea | 5.303774e-34 | Not Normal Distributrion |
| 22 | MiscVal | 5.866475e-34 | Not Normal Distributrion |
| 23 | MoSold | 3.800067e-05 | Not Normal Distributrion |
| 24 | YrSold | 5.578459e-13 | Not Normal Distributrion |
# Now, we know that all numerical features, either from train or validation dataset are not comes from normal distribution
# So i will perform normalization for all of them. But before that, 'MoSold','YrSold', and 'GarageYrBlt' should be separate
# because they are datetime features, and standarization in any form will make no sense.
year_feature = ['YearBuilt','YearRemodAdd','GarageYrBlt','MoSold','YrSold']
num_features = [col for col in num_features if col not in year_feature]
len(num_features)
20
# Distribution Visual
figDis,axDis = plt.subplots(5,4,figsize=(15,15))
sns.histplot(ax=axDis[0,0],x='LotFrontage',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[0,1],x='LotArea',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[0,2],x='MasVnrArea',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[0,3],x='BsmtFinSF1',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[1,0],x='BsmtFinSF2',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[1,1],x='BsmtUnfSF',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[1,2],x='TotalBsmtSF',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[1,3],x='1stFlrSF',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[2,0],x='2ndFlrSF',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[2,1],x='GrLivArea',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[2,2],x='GarageCars',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[2,3],x='GarageArea',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[3,0],x='WoodDeckSF',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[3,1],x='OpenPorchSF',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[3,2],x='EnclosedPorch',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[3,3],x='3SsnPorch',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[4,0],x='ScreenPorch',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[4,1],x='PoolArea',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[4,2],x='MiscVal',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[4,3],x='LowQualFinSF',data=all_num_cat_trans_train_X)
plt.subplots_adjust(wspace=0.4)
plt.show()
# Scaling predictor and target features
from sklearn.preprocessing import RobustScaler
num_feature = num_features[:20]
num_feature
robustScaler = RobustScaler()
scaled_num_robust_train_X = pd.DataFrame(robustScaler.fit_transform(all_num_cat_trans_train_X[num_feature]),
columns=num_feature)
scaled_num_robust_val_X = pd.DataFrame(robustScaler.transform(all_num_cat_trans_val_X[num_feature]),
columns=num_feature)
# For Test dataset
num_feature = num_features[:20]
scaled_num_robust_test = pd.DataFrame(robustScaler.transform(all_num_cat_trans_test[num_feature]),
columns=num_feature)
scaled_num_robust_test
| LotFrontage | LotArea | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | GarageCars | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.35 | 0.513038 | 0.000000 | 0.094816 | 144.0 | -0.361218 | -0.260600 | -0.406386 | 0.000000 | 0.0 | -0.931292 | -1.0 | 1.219331 | 0.828402 | -0.405797 | 0.0 | 0.0 | 120.0 | 0.0 | 0.0 |
| 1 | 0.40 | 1.182658 | 0.622478 | 0.715553 | 0.0 | -0.127842 | 0.663909 | 0.431543 | 0.000000 | 0.0 | -0.251276 | -1.0 | -0.852540 | 2.325444 | 0.115942 | 0.0 | 0.0 | 0.0 | 0.0 | 12500.0 |
| 2 | 0.05 | 1.072025 | 0.000000 | 0.535471 | 0.0 | -0.589447 | -0.165460 | -0.344461 | 0.948579 | 0.0 | 0.219867 | 0.0 | -0.009913 | 1.254438 | 0.086957 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0.25 | 0.096835 | 0.115274 | 0.277626 | 0.0 | -0.268554 | -0.169597 | -0.348331 | 0.917456 | 0.0 | 0.180605 | 0.0 | -0.069393 | 2.130178 | 0.115942 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | -1.50 | -1.162152 | 0.000000 | -0.184857 | 0.0 | 0.920635 | 0.562565 | 0.336720 | 0.000000 | 0.0 | -0.328229 | 0.0 | 0.109046 | 0.000000 | 0.782609 | 0.0 | 0.0 | 144.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1454 | -2.60 | -1.939114 | 0.000000 | -0.543656 | 0.0 | 0.112398 | -0.955533 | -1.083696 | 0.738836 | 0.0 | -0.623479 | -2.0 | -2.399009 | 0.000000 | -0.405797 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1455 | -2.60 | -1.949747 | 0.000000 | -0.199864 | 0.0 | -0.320034 | -0.955533 | -1.083696 | 0.738836 | 0.0 | -0.623479 | -1.0 | -0.981413 | 0.000000 | -0.057971 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1456 | 4.35 | 2.634051 | 0.000000 | 1.126194 | 0.0 | -0.824539 | 0.446743 | 0.228350 | 0.000000 | 0.0 | -0.416176 | 0.0 | 0.456010 | 2.804734 | -0.405797 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1457 | -0.55 | 0.214051 | 0.000000 | -0.083902 | 0.0 | 0.162162 | -0.198552 | -0.263183 | 0.000000 | 0.0 | -0.815077 | -2.0 | -2.399009 | 0.473373 | 0.057971 | 0.0 | 0.0 | 0.0 | 0.0 | 700.0 |
| 1458 | 0.05 | 0.007975 | 0.541787 | 0.490450 | 0.0 | -0.416130 | -0.024819 | -0.212869 | 1.358593 | 0.0 | 0.802513 | 1.0 | 0.822800 | 1.124260 | 0.289855 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1459 rows × 20 columns
all_num_cat_trans_train_X.shape
cat_features = [col for col in all_num_cat_trans_train_X.columns if col not in num_features]
len(cat_features) # 202 cat features
len(scaled_num_robust_train_X.columns) # 20 features
20
scaled_num_robust_train_X.index = all_num_cat_trans_train_X.index
scaled_num_robust_val_X.index = all_num_cat_trans_val_X.index
# drop all number features from original dataset in train and val
all_cat_trans_train_X = all_num_cat_trans_train_X[cat_features]
all_cat_trans_val_X = all_num_cat_trans_val_X[cat_features]
len(all_cat_trans_train_X.columns)
202
# For test dataset
all_cat_trans_test = all_num_cat_trans_test[cat_features]
len(all_cat_trans_test.columns)
202
scaled_all_num_cat_robust_trainX = pd.concat([scaled_num_robust_train_X,all_cat_trans_train_X],axis=1)
scaled_all_num_cat_robust_valX = pd.concat([scaled_num_robust_val_X,all_cat_trans_val_X],axis=1)
scaled_all_num_cat_robust_trainX
| LotFrontage | LotArea | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | LowQualFinSF | ... | x25_ConLI | x25_ConLw | x25_New | x25_WD | x26_Abnorml | x26_AdjLand | x26_Alloca | x26_Family | x26_Normal | x26_Partial | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Id | |||||||||||||||||||||
| 619 | 0.85 | 0.531266 | 2.605187 | -0.478172 | 0.0 | 2.219648 | 1.683557 | 1.397194 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 871 | -0.65 | -0.758354 | 0.000000 | -0.543656 | 0.0 | 0.709567 | -0.235781 | -0.410256 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 93 | 0.35 | 0.953038 | 0.000000 | 0.429059 | 0.0 | -0.544831 | -0.273009 | -0.274794 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 818 | 0.10 | 0.928987 | 0.853026 | 1.118008 | 0.0 | -0.223938 | 1.158221 | 1.128205 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 303 | 2.25 | 1.040127 | 0.864553 | -0.543656 | 0.0 | 1.819820 | 1.102378 | 0.841800 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 764 | 0.45 | -0.041899 | 3.878963 | 1.042974 | 0.0 | -0.671815 | 0.504654 | 0.313498 | 1.484438 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 836 | -0.65 | 0.001139 | 0.000000 | 0.059345 | 0.0 | 0.247962 | 0.122027 | -0.075472 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1217 | -0.25 | -0.168481 | 0.000000 | -0.543656 | 0.0 | -0.824539 | -2.084798 | 0.410256 | 0.790257 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 560 | -1.45 | -1.620127 | 0.103746 | -0.543656 | 0.0 | 1.533248 | 0.756980 | 0.872762 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 685 | -0.75 | 1.816329 | 0.172911 | -0.543656 | 0.0 | 1.226083 | 0.386763 | 0.172230 | 0.871448 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1110 rows × 222 columns
# Concate for test dataset
scaled_all_num_cat_robust_test = pd.concat([scaled_num_robust_test,all_cat_trans_test],axis=1)
scaled_all_num_cat_robust_test
| LotFrontage | LotArea | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | LowQualFinSF | ... | x25_ConLI | x25_ConLw | x25_New | x25_WD | x26_Abnorml | x26_AdjLand | x26_Alloca | x26_Family | x26_Normal | x26_Partial | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.35 | 0.513038 | 0.000000 | 0.094816 | 144.0 | -0.361218 | -0.260600 | -0.406386 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 0.40 | 1.182658 | 0.622478 | 0.715553 | 0.0 | -0.127842 | 0.663909 | 0.431543 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 0.05 | 1.072025 | 0.000000 | 0.535471 | 0.0 | -0.589447 | -0.165460 | -0.344461 | 0.948579 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 0.25 | 0.096835 | 0.115274 | 0.277626 | 0.0 | -0.268554 | -0.169597 | -0.348331 | 0.917456 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | -1.50 | -1.162152 | 0.000000 | -0.184857 | 0.0 | 0.920635 | 0.562565 | 0.336720 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1454 | -2.60 | -1.939114 | 0.000000 | -0.543656 | 0.0 | 0.112398 | -0.955533 | -1.083696 | 0.738836 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1455 | -2.60 | -1.949747 | 0.000000 | -0.199864 | 0.0 | -0.320034 | -0.955533 | -1.083696 | 0.738836 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1456 | 4.35 | 2.634051 | 0.000000 | 1.126194 | 0.0 | -0.824539 | 0.446743 | 0.228350 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1457 | -0.55 | 0.214051 | 0.000000 | -0.083902 | 0.0 | 0.162162 | -0.198552 | -0.263183 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1458 | 0.05 | 0.007975 | 0.541787 | 0.490450 | 0.0 | -0.416130 | -0.024819 | -0.212869 | 1.358593 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1459 rows × 222 columns
scaled_all_num_cat_robust_trainX
| LotFrontage | LotArea | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | LowQualFinSF | ... | x25_ConLI | x25_ConLw | x25_New | x25_WD | x26_Abnorml | x26_AdjLand | x26_Alloca | x26_Family | x26_Normal | x26_Partial | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Id | |||||||||||||||||||||
| 619 | 0.85 | 0.531266 | 2.605187 | -0.478172 | 0.0 | 2.219648 | 1.683557 | 1.397194 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 871 | -0.65 | -0.758354 | 0.000000 | -0.543656 | 0.0 | 0.709567 | -0.235781 | -0.410256 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 93 | 0.35 | 0.953038 | 0.000000 | 0.429059 | 0.0 | -0.544831 | -0.273009 | -0.274794 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 818 | 0.10 | 0.928987 | 0.853026 | 1.118008 | 0.0 | -0.223938 | 1.158221 | 1.128205 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 303 | 2.25 | 1.040127 | 0.864553 | -0.543656 | 0.0 | 1.819820 | 1.102378 | 0.841800 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 764 | 0.45 | -0.041899 | 3.878963 | 1.042974 | 0.0 | -0.671815 | 0.504654 | 0.313498 | 1.484438 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 836 | -0.65 | 0.001139 | 0.000000 | 0.059345 | 0.0 | 0.247962 | 0.122027 | -0.075472 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1217 | -0.25 | -0.168481 | 0.000000 | -0.543656 | 0.0 | -0.824539 | -2.084798 | 0.410256 | 0.790257 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 560 | -1.45 | -1.620127 | 0.103746 | -0.543656 | 0.0 | 1.533248 | 0.756980 | 0.872762 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 685 | -0.75 | 1.816329 | 0.172911 | -0.543656 | 0.0 | 1.226083 | 0.386763 | 0.172230 | 0.871448 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1110 rows × 222 columns
scaled_all_num_cat_robust_valX
| LotFrontage | LotArea | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | LowQualFinSF | ... | x25_ConLI | x25_ConLw | x25_New | x25_WD | x26_Abnorml | x26_AdjLand | x26_Alloca | x26_Family | x26_Normal | x26_Partial | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Id | |||||||||||||||||||||
| 530 | -0.10 | 5.841139 | 0.000000 | 1.119372 | 0.0 | 0.575719 | 2.124095 | 2.726657 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 492 | 0.30 | -0.026709 | 0.000000 | 0.006139 | 165.0 | -0.416130 | -0.417787 | -0.286405 | 0.838972 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 460 | 0.30 | -0.653291 | 0.927954 | -0.291269 | 0.0 | 0.074646 | -0.618407 | -0.245767 | 0.303112 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 280 | 0.50 | 0.103671 | 1.723343 | -0.008868 | 0.0 | 0.493350 | 0.314374 | 0.096759 | 1.171854 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 656 | -2.60 | -2.003924 | 2.195965 | -0.543656 | 0.0 | 0.076362 | -0.998966 | -1.124335 | 0.767253 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1030 | -2.60 | -2.003924 | 1.619597 | -0.111187 | 0.0 | -0.215358 | -0.694933 | -0.839865 | 0.738836 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 327 | -2.05 | 0.316582 | 0.000000 | 1.665075 | 0.0 | -0.652939 | 1.470527 | 1.186260 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 441 | 1.60 | 1.477342 | 1.152738 | 1.866985 | 539.0 | 0.527671 | 4.314374 | 2.507983 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1388 | -0.65 | -0.272278 | 0.000000 | -0.314461 | 546.0 | -0.824539 | -0.608066 | 1.079826 | 1.166441 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 62 | -0.65 | -0.606456 | 0.000000 | -0.543656 | 0.0 | 0.084942 | -0.988625 | -1.015965 | 0.717185 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
269 rows × 222 columns
figDis,axDis = plt.subplots(5,4,figsize=(15,15))
sns.histplot(ax=axDis[0,0],x='LotFrontage',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[0,1],x='LotArea',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[0,2],x='MasVnrArea',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[0,3],x='BsmtFinSF1',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[1,0],x='BsmtFinSF2',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[1,1],x='BsmtUnfSF',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[1,2],x='TotalBsmtSF',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[1,3],x='1stFlrSF',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[2,0],x='2ndFlrSF',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[2,1],x='GrLivArea',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[2,2],x='GarageCars',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[2,3],x='GarageArea',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[3,0],x='WoodDeckSF',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[3,1],x='OpenPorchSF',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[3,2],x='EnclosedPorch',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[3,3],x='3SsnPorch',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[4,0],x='ScreenPorch',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[4,1],x='PoolArea',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[4,2],x='MiscVal',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[4,3],x='LowQualFinSF',data=scaled_all_num_cat_robust_trainX)
plt.subplots_adjust(wspace=0.4)
plt.show()
# Experiment to normalize data using
from sklearn.preprocessing import PowerTransformer
powerTrans = PowerTransformer()
normalize_num_train_X = pd.DataFrame(powerTrans.fit_transform(all_num_cat_trans_train_X[num_features]),columns=num_features)
normalize_num_train_X.shape
(1110, 20)
figNorm,axNorm = plt.subplots(5,4,figsize=(15,15))
sns.histplot(ax=axNorm[0,0],x='LotFrontage',data=normalize_num_train_X)
sns.histplot(ax=axNorm[0,1],x='LotArea',data=normalize_num_train_X)
sns.histplot(ax=axNorm[0,2],x='MasVnrArea',data=normalize_num_train_X)
sns.histplot(ax=axNorm[0,3],x='BsmtFinSF1',data=normalize_num_train_X)
sns.histplot(ax=axNorm[1,0],x='BsmtFinSF2',data=normalize_num_train_X)
sns.histplot(ax=axNorm[1,1],x='BsmtUnfSF',data=normalize_num_train_X)
sns.histplot(ax=axNorm[1,2],x='TotalBsmtSF',data=normalize_num_train_X)
sns.histplot(ax=axNorm[1,3],x='1stFlrSF',data=normalize_num_train_X)
sns.histplot(ax=axNorm[2,0],x='2ndFlrSF',data=normalize_num_train_X)
sns.histplot(ax=axNorm[2,1],x='GrLivArea',data=normalize_num_train_X)
sns.histplot(ax=axNorm[2,2],x='GarageCars',data=normalize_num_train_X)
sns.histplot(ax=axNorm[2,3],x='GarageArea',data=normalize_num_train_X)
sns.histplot(ax=axNorm[3,0],x='WoodDeckSF',data=normalize_num_train_X)
sns.histplot(ax=axNorm[3,1],x='OpenPorchSF',data=normalize_num_train_X)
sns.histplot(ax=axNorm[3,2],x='EnclosedPorch',data=normalize_num_train_X)
sns.histplot(ax=axNorm[3,3],x='3SsnPorch',data=normalize_num_train_X)
sns.histplot(ax=axNorm[4,0],x='ScreenPorch',data=normalize_num_train_X)
sns.histplot(ax=axNorm[4,1],x='PoolArea',data=normalize_num_train_X)
sns.histplot(ax=axNorm[4,2],x='MiscVal',data=normalize_num_train_X)
sns.histplot(ax=axDis[4,3],x='LowQualFinSF',data=normalize_num_train_X)
plt.subplots_adjust(wspace=0.4)
plt.show()
# We can see that almost all numeric features distribution move closer to Gaussian distribution.
ShapiroTest(normalize_num_train_X,0.05)
# even so, all of them still can't pass normality test.
# So, here i decide to proceed my analysis based only on scaled data.
| Feature | P-Value | Conclusion | |
|---|---|---|---|
| 0 | LotFrontage | 2.680891e-20 | Not Normal Distributrion |
| 1 | LotArea | 1.316176e-26 | Not Normal Distributrion |
| 2 | MasVnrArea | 1.129587e-41 | Not Normal Distributrion |
| 3 | BsmtFinSF1 | 4.902530e-34 | Not Normal Distributrion |
| 4 | BsmtFinSF2 | 0.000000e+00 | Not Normal Distributrion |
| 5 | BsmtUnfSF | 2.855252e-11 | Not Normal Distributrion |
| 6 | TotalBsmtSF | 3.500056e-17 | Not Normal Distributrion |
| 7 | 1stFlrSF | 9.795268e-03 | Not Normal Distributrion |
| 8 | 2ndFlrSF | 1.726400e-42 | Not Normal Distributrion |
| 9 | LowQualFinSF | 0.000000e+00 | Not Normal Distributrion |
| 10 | GrLivArea | 8.524834e-03 | Not Normal Distributrion |
| 11 | GarageCars | 2.182170e-36 | Not Normal Distributrion |
| 12 | GarageArea | 8.502025e-07 | Not Normal Distributrion |
| 13 | WoodDeckSF | 3.241063e-40 | Not Normal Distributrion |
| 14 | OpenPorchSF | 2.072875e-35 | Not Normal Distributrion |
| 15 | EnclosedPorch | 0.000000e+00 | Not Normal Distributrion |
| 16 | 3SsnPorch | 0.000000e+00 | Not Normal Distributrion |
| 17 | ScreenPorch | 0.000000e+00 | Not Normal Distributrion |
| 18 | PoolArea | 0.000000e+00 | Not Normal Distributrion |
| 19 | MiscVal | 0.000000e+00 | Not Normal Distributrion |
# Transform y-feature (salePrice)
new_train_y = train_y.values.reshape(-1,1)
new_val_y = val_y.values.reshape(-1,1)
robustScaler_forPrice = RobustScaler()
scaled_train_y = pd.DataFrame(robustScaler_forPrice.fit_transform(new_train_y),columns=['SalePrice'])
scaled_val_y = pd.DataFrame(robustScaler_forPrice.transform(new_val_y),columns=['SalePrice'])
scaled_train_y.index = train_y.index+1
scaled_val_y.index = val_y.index+1
scaled_train_y.head()
| SalePrice | |
|---|---|
| 619 | 1.786035 |
| 871 | -0.629412 |
| 93 | 0.005882 |
| 818 | 1.270588 |
| 303 | 0.494118 |
scaled_val_y.head()
| SalePrice | |
|---|---|
| 530 | 0.442635 |
| 492 | -0.352941 |
| 460 | -0.623529 |
| 280 | 0.341176 |
| 656 | -0.882353 |
scaled_all_num_cat_robust_trainX.index
Int64Index([ 619, 871, 93, 818, 303, 1455, 41, 960, 76, 1390,
...
1095, 600, 278, 1034, 1384, 764, 836, 1217, 560, 685],
dtype='int64', name='Id', length=1110)
# Adjust train and validation Y index
scaled_train_y = scaled_train_y.loc[scaled_all_num_cat_robust_trainX.index]
scaled_train_y
| SalePrice | |
|---|---|
| Id | |
| 619 | 1.786035 |
| 871 | -0.629412 |
| 93 | 0.005882 |
| 818 | 1.270588 |
| 303 | 0.494118 |
| ... | ... |
| 764 | 2.047059 |
| 836 | -0.411765 |
| 1217 | -0.600000 |
| 560 | 0.835294 |
| 685 | 0.682353 |
1110 rows × 1 columns
scaled_all_num_cat_robust_valX.index
Int64Index([ 530, 492, 460, 280, 656, 1014, 1404, 602, 1183, 688,
...
1360, 1302, 1428, 1403, 1224, 1030, 327, 441, 1388, 62],
dtype='int64', name='Id', length=269)
# Adjust train and validation Y index
scaled_val_y = scaled_val_y.loc[scaled_all_num_cat_robust_valX.index]
scaled_val_y
| SalePrice | |
|---|---|
| Id | |
| 530 | 0.442635 |
| 492 | -0.352941 |
| 460 | -0.623529 |
| 280 | 0.341176 |
| 656 | -0.882353 |
| ... | ... |
| 1030 | -0.529412 |
| 327 | 1.894118 |
| 441 | 4.611765 |
| 1388 | -0.317647 |
| 62 | -0.729412 |
269 rows × 1 columns
# Outliers
# I wouldn't do any outlier detection at this point
# The reason are : i already scaling it, and any outlier detection after this stage seems to generate a condition where
# all rows considered outliers. To prove this, run the following code to compare the result
# of outlier detection using scaled and original dataset
## NOTICE : It is not a general rule, but maybe it would be more meaningfull to detect and treat outlier before all
## preprocessing steps. Each case and each method of transformation has different impact.
## uncomment one of these part
###############################################
##### Outliers detection in scaled data set
#Q1 = scaled_all_num_cat_robust_trainX.quantile(0.25)
#Q3 = scaled_all_num_cat_robust_trainX.quantile(0.75)
# calculate the IQR
#QR = Q3 - Q1
# filter the dataset with the IQR
#IQR_outliers = scaled_all_num_cat_robust_trainX[((scaled_all_num_cat_robust_trainX < (Q1 - 1.5 * IQR)) |(scaled_all_num_cat_robust_trainX > (Q3 + 1.5 * IQR))).any(axis=1)]
#IQR_outliers
###############################################
##### Outliers detection in original data set
#Q1 = train_X.quantile(0.25)
#Q3 = train_X.quantile(0.75)
# calculate the IQR
#QR = Q3 - Q1
# filter the dataset with the IQR
#IQR_outliers = train_X[((train_X < (Q1 - 1.5 * IQR)) |(train_X > (Q3 + 1.5 * IQR))).any(axis=1)]
#IQR_outliers
Ames Housing dataset has 222 features to use for model building. While it is a bliss to have lot of data, it is also a curse. More data means higher cost of maintaining database and collecting data. It is also burden our computing machine, and make time spend to train our model longer
Not all features need to be included in model, and apparently, some important predictor can only be obtained by processing and manipulating existing features to generate a new one.
For this section, i will conduct three feature selection method, i.e:
For a detailed explanation of each method, please read more credible sources. Personally, I recommend articles from analyticsvidhya.com and kdnuggets.com
Before i perform all selection process, i will do some feature engineering instinctively. If all process generate result as expected, then i would proceed feature selection with all method i mentioned before.
At the end of this section, i will have 3 different subsets of data ready to trained with model
# Feature Engineering
# Year Feature Vs SalePrice
figYr,axYr = plt.subplots(2,2, figsize=(10,8))
sns.lineplot(ax=axYr[0,0],x=scaled_all_num_cat_robust_trainX['YearBuilt'], y = scaled_train_y['SalePrice'])
sns.lineplot(ax=axYr[0,1],x=scaled_all_num_cat_robust_trainX['YearRemodAdd'],y=scaled_train_y['SalePrice'])
sns.lineplot(ax=axYr[1,0],x=scaled_all_num_cat_robust_trainX['MoSold'], y = scaled_train_y['SalePrice'])
sns.lineplot(ax=axYr[1,1],x=scaled_all_num_cat_robust_trainX['YrSold'],y=scaled_train_y['SalePrice'])
plt.show()
SalePrice has increasing trend against YearBuilt and YearRemodAdd. There are other factors involved like price level that should be considered, but because of the limitation of data, i decide to use this price as my analysis basis. Future research may be conducted by adding inflation level feature so the real price effect can be examined.
A support basis for this decision comes from Selim, H (2009) that analyzing determinant of house prices in Turkey. Selim,H use similiar features as Ames, Iowa dataset to break down how housing price in Turkey determined by applied and compared predicted result from Hedonic Regression and Artificial Neural Network. Selim,H(2009) also use nominal price selling as target features, and transform it using natural logarithm.
As opposite, SalePrice seems to have decreasing trend against YrSold. This may have connection to the fact that the sales record only available from 2006 to 2010, a time period around Global Financial Crisis 2009 due to subprime mortgage. This short period of time(4 years of record) is rather unbalance compared to YearBuilt or YearRemodAdd, hence, i probably won't use it as house price determinant because of the earlier reasoning and significant event that occurs at that period.
Rather than use year feature as it is, i will use age of the house. House age has been use as determinant feature and proved to be statistically significant by some papers, such as Selim, H(2009), and Xu, Y., Zhang, Q., Zheng, S. et al(2018)
I will generate two features, that is age based on building year and age based on remodelling year, to see the difference and its impact.
# engineering AgeBuilt
scaled_all_num_cat_robust_trainX['AgeBuilt'] = scaled_all_num_cat_robust_trainX['YrSold'] - scaled_all_num_cat_robust_trainX['YearBuilt']
# engineering AgeRemodAdd
scaled_all_num_cat_robust_trainX['AgeRemodAdd'] = scaled_all_num_cat_robust_trainX['YrSold'] - scaled_all_num_cat_robust_trainX['YearRemodAdd']
figAge, axAge = plt.subplots(1,2,figsize=(10,5))
sns.lineplot(ax=axAge[0],x=scaled_all_num_cat_robust_trainX['AgeBuilt'],y=scaled_train_y['SalePrice'])
sns.lineplot(ax=axAge[1],x=scaled_all_num_cat_robust_trainX['AgeRemodAdd'],y=scaled_train_y['SalePrice'])
plt.show()
# Generate the same features in validation data
# engineering AgeBuilt
scaled_all_num_cat_robust_valX['AgeBuilt'] = scaled_all_num_cat_robust_valX['YrSold'] - scaled_all_num_cat_robust_valX['YearBuilt']
# engineering AgeRemodAdd
scaled_all_num_cat_robust_valX['AgeRemodAdd'] = scaled_all_num_cat_robust_valX['YrSold'] - scaled_all_num_cat_robust_valX['YearRemodAdd']
# for test dataset
scaled_all_num_cat_robust_test['AgeBuilt'] = scaled_all_num_cat_robust_test["YrSold"] - scaled_all_num_cat_robust_test['YearBuilt']
scaled_all_num_cat_robust_test['AgeRemodAdd'] = scaled_all_num_cat_robust_test['YrSold'] - scaled_all_num_cat_robust_test['YearBuilt']
figAge, axAge = plt.subplots(1,2,figsize=(10,5))
sns.lineplot(ax=axAge[0],x=scaled_all_num_cat_robust_valX['AgeBuilt'],y=scaled_val_y['SalePrice'])
sns.lineplot(ax=axAge[1],x=scaled_all_num_cat_robust_valX['AgeRemodAdd'],y=scaled_val_y['SalePrice'])
plt.show()
# Scaling Age Features
robustScaler_forAge = RobustScaler()
scaled_age_train_X = pd.DataFrame(robustScaler_forAge.fit_transform(scaled_all_num_cat_robust_trainX[['AgeBuilt','AgeRemodAdd']]),
columns=['AgeBuilt','AgeRemodAdd'])
scaled_age_val_X = pd.DataFrame(robustScaler_forAge.transform(scaled_all_num_cat_robust_valX[['AgeBuilt','AgeRemodAdd']]),
columns=['AgeBuilt','AgeRemodAdd'])
# Scaling age features for test dataset
scaled_age_test = pd.DataFrame(robustScaler_forAge.transform(scaled_all_num_cat_robust_test[['AgeBuilt','AgeRemodAdd']]),
columns=['AgeBuilt','AgeRemodAdd'])
# Index
scaled_age_train_X.index = scaled_all_num_cat_robust_trainX.index
scaled_age_val_X.index = scaled_all_num_cat_robust_valX.index
scaled_age_train_X
| AgeBuilt | AgeRemodAdd | |
|---|---|---|
| Id | ||
| 619 | -0.717391 | -0.388889 |
| 871 | 0.304348 | 0.916667 |
| 93 | 1.195652 | -0.305556 |
| 818 | -0.586957 | -0.222222 |
| 303 | -0.608696 | -0.277778 |
| ... | ... | ... |
| 764 | -0.500000 | -0.111111 |
| 836 | 0.586957 | 0.027778 |
| 1217 | -0.021739 | 0.500000 |
| 560 | -0.652174 | -0.333333 |
| 685 | -0.456522 | -0.055556 |
1110 rows × 2 columns
scaled_age_test.index = scaled_all_num_cat_robust_test.index
scaled_age_test
| AgeBuilt | AgeRemodAdd | |
|---|---|---|
| 0 | 0.347826 | 0.972222 |
| 1 | 0.413043 | 1.055556 |
| 2 | -0.434783 | -0.027778 |
| 3 | -0.456522 | -0.055556 |
| 4 | -0.326087 | 0.111111 |
| ... | ... | ... |
| 1454 | 0.065217 | 0.611111 |
| 1455 | 0.065217 | 0.611111 |
| 1456 | 0.282609 | 0.888889 |
| 1457 | -0.413043 | 0.000000 |
| 1458 | -0.434783 | -0.027778 |
1459 rows × 2 columns
# drop unscaling AgeBuilt and AgeRemodAdd
scaled_all_num_cat_robust_trainX = scaled_all_num_cat_robust_trainX.drop(['AgeBuilt','AgeRemodAdd'],axis=1)
scaled_all_num_cat_robust_valX = scaled_all_num_cat_robust_valX.drop(['AgeBuilt','AgeRemodAdd'],axis=1)
scaled_all_num_cat_robust_trainX
| LotFrontage | LotArea | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | LowQualFinSF | ... | x25_ConLI | x25_ConLw | x25_New | x25_WD | x26_Abnorml | x26_AdjLand | x26_Alloca | x26_Family | x26_Normal | x26_Partial | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Id | |||||||||||||||||||||
| 619 | 0.85 | 0.531266 | 2.605187 | -0.478172 | 0.0 | 2.219648 | 1.683557 | 1.397194 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 871 | -0.65 | -0.758354 | 0.000000 | -0.543656 | 0.0 | 0.709567 | -0.235781 | -0.410256 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 93 | 0.35 | 0.953038 | 0.000000 | 0.429059 | 0.0 | -0.544831 | -0.273009 | -0.274794 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 818 | 0.10 | 0.928987 | 0.853026 | 1.118008 | 0.0 | -0.223938 | 1.158221 | 1.128205 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 303 | 2.25 | 1.040127 | 0.864553 | -0.543656 | 0.0 | 1.819820 | 1.102378 | 0.841800 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 764 | 0.45 | -0.041899 | 3.878963 | 1.042974 | 0.0 | -0.671815 | 0.504654 | 0.313498 | 1.484438 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 836 | -0.65 | 0.001139 | 0.000000 | 0.059345 | 0.0 | 0.247962 | 0.122027 | -0.075472 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1217 | -0.25 | -0.168481 | 0.000000 | -0.543656 | 0.0 | -0.824539 | -2.084798 | 0.410256 | 0.790257 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 560 | -1.45 | -1.620127 | 0.103746 | -0.543656 | 0.0 | 1.533248 | 0.756980 | 0.872762 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 685 | -0.75 | 1.816329 | 0.172911 | -0.543656 | 0.0 | 1.226083 | 0.386763 | 0.172230 | 0.871448 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1110 rows × 222 columns
# For test dataset
scaled_all_num_cat_robust_test = scaled_all_num_cat_robust_test.drop(['AgeBuilt','AgeRemodAdd'],axis=1)
# concat the scaled features
scaled_all_num_cat_robust_trainX = pd.concat([scaled_all_num_cat_robust_trainX,scaled_age_train_X],axis=1)
scaled_all_num_cat_robust_valX = pd.concat([scaled_all_num_cat_robust_valX,scaled_age_val_X],axis=1)
scaled_all_num_cat_robust_trainX
| LotFrontage | LotArea | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | LowQualFinSF | ... | x25_New | x25_WD | x26_Abnorml | x26_AdjLand | x26_Alloca | x26_Family | x26_Normal | x26_Partial | AgeBuilt | AgeRemodAdd | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Id | |||||||||||||||||||||
| 619 | 0.85 | 0.531266 | 2.605187 | -0.478172 | 0.0 | 2.219648 | 1.683557 | 1.397194 | 0.000000 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.717391 | -0.388889 |
| 871 | -0.65 | -0.758354 | 0.000000 | -0.543656 | 0.0 | 0.709567 | -0.235781 | -0.410256 | 0.000000 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.304348 | 0.916667 |
| 93 | 0.35 | 0.953038 | 0.000000 | 0.429059 | 0.0 | -0.544831 | -0.273009 | -0.274794 | 0.000000 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.195652 | -0.305556 |
| 818 | 0.10 | 0.928987 | 0.853026 | 1.118008 | 0.0 | -0.223938 | 1.158221 | 1.128205 | 0.000000 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.586957 | -0.222222 |
| 303 | 2.25 | 1.040127 | 0.864553 | -0.543656 | 0.0 | 1.819820 | 1.102378 | 0.841800 | 0.000000 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.608696 | -0.277778 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 764 | 0.45 | -0.041899 | 3.878963 | 1.042974 | 0.0 | -0.671815 | 0.504654 | 0.313498 | 1.484438 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.500000 | -0.111111 |
| 836 | -0.65 | 0.001139 | 0.000000 | 0.059345 | 0.0 | 0.247962 | 0.122027 | -0.075472 | 0.000000 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.586957 | 0.027778 |
| 1217 | -0.25 | -0.168481 | 0.000000 | -0.543656 | 0.0 | -0.824539 | -2.084798 | 0.410256 | 0.790257 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.021739 | 0.500000 |
| 560 | -1.45 | -1.620127 | 0.103746 | -0.543656 | 0.0 | 1.533248 | 0.756980 | 0.872762 | 0.000000 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.652174 | -0.333333 |
| 685 | -0.75 | 1.816329 | 0.172911 | -0.543656 | 0.0 | 1.226083 | 0.386763 | 0.172230 | 0.871448 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.456522 | -0.055556 |
1110 rows × 224 columns
# For test dataset
scaled_all_num_cat_robust_test = pd.concat([scaled_all_num_cat_robust_test,scaled_age_test],axis=1)
scaled_all_num_cat_robust_test.shape
(1459, 224)
# Feature Engineering Using KMeans
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=6) # 6 is abritrary, we can do experiment with different cluster
# For trainX
scaled_all_num_cat_robust_trainX['Cluster'] = kmeans.fit_predict(scaled_all_num_cat_robust_trainX)
scaled_all_num_cat_robust_trainX['Cluster'] = scaled_all_num_cat_robust_trainX['Cluster'].astype('category')
# For valX
scaled_all_num_cat_robust_valX['Cluster'] = kmeans.fit_predict(scaled_all_num_cat_robust_valX)
# for test
scaled_all_num_cat_robust_test['Cluster'] = kmeans.fit_predict(scaled_all_num_cat_robust_test)
# Feature Selection
cat_features.append('Cluster')
# Mutual Information
# Mutual Information treat discrete feature differently with continous feature. First, we need to sort them
disc_features = scaled_all_num_cat_robust_trainX.columns.isin(cat_features).tolist()
# in our case, because of standarization, all value of has been become float64. That's why i sort them by using
# cat_features and num_features i had created before
disc_features
[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, True]
# Feature Selection
# Mutual Information
# Mutual Information treat discrete feature differently with continous feature. First, we need to sort them
from sklearn.feature_selection import mutual_info_regression
def make_mi_scores(X, y, discrete_features):
mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
mi_scores = mi_scores.sort_values(ascending=False)
return mi_scores
ames_mi_score = make_mi_scores(scaled_all_num_cat_robust_trainX,scaled_train_y['SalePrice'],discrete_features=disc_features)
# Top 10 related features
ames_mi_score.head(10)
OverallQual 0.541482 GrLivArea 0.486116 YearBuilt 0.403426 TotalBsmtSF 0.402351 GarageCars 0.350771 GarageArea 0.349153 AgeBuilt 0.337192 BsmtQual 0.324694 1stFlrSF 0.318102 GarageYrBlt 0.314362 Name: MI Scores, dtype: float64
# Bottom 10 related features
ames_mi_score.tail(10)
x11_2.5Fin 0.0 x11_1Story 0.0 x18_GasW 0.0 x10_TwnhsE 0.0 x2_IR3 0.0 x21_Min1 0.0 x9_Norm 0.0 x21_Mod 0.0 x1_Pave 0.0 Cluster 0.0 Name: MI Scores, dtype: float64
# To make comparison easier, here the plot bar of all MI score
plt.figure(figsize=(20,15))
sns.barplot(y=ames_mi_score.index[:50],x=ames_mi_score[:50])
plt.show()
# filter out feature with Mutual Information lower than threshold; abritrary = 0.3
highest_mi_score =ames_mi_score[ames_mi_score > 0.3].index.tolist()
mi_all_num_cat_robust_trainX = scaled_all_num_cat_robust_trainX[highest_mi_score]
mi_all_num_cat_robust_valX = scaled_all_num_cat_robust_valX[highest_mi_score]
mi_all_num_cat_robust_trainX
| OverallQual | GrLivArea | YearBuilt | TotalBsmtSF | GarageCars | GarageArea | AgeBuilt | BsmtQual | 1stFlrSF | GarageYrBlt | ExterQual | KitchenQual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Id | ||||||||||||
| 619 | 7.0 | 0.532391 | 2007 | 1.683557 | 1.0 | 1.437423 | -0.717391 | 0.0 | 1.397194 | 2007.0 | 0.0 | 2.0 |
| 871 | 3.0 | -0.934433 | 1962 | -0.235781 | -1.0 | -0.872367 | 0.304348 | 4.0 | -0.410256 | 1962.0 | 3.0 | 3.0 |
| 93 | 3.0 | -0.824499 | 1921 | -0.273009 | 0.0 | -0.257745 | 1.195652 | 2.0 | -0.274794 | 1921.0 | 3.0 | 3.0 |
| 818 | 6.0 | 0.314095 | 2002 | 1.158221 | 1.0 | 1.848823 | -0.586957 | 2.0 | 1.128205 | 2002.0 | 2.0 | 2.0 |
| 303 | 5.0 | 0.081665 | 2001 | 1.102378 | 1.0 | 1.779430 | -0.608696 | 2.0 | 0.841800 | 2001.0 | 2.0 | 2.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 764 | 6.0 | 1.375736 | 1999 | 0.504654 | 1.0 | 1.843866 | -0.500000 | 2.0 | 0.313498 | 1999.0 | 2.0 | 2.0 |
| 836 | 2.0 | -0.662740 | 1950 | 0.122027 | 0.0 | -0.237918 | 0.586957 | 2.0 | -0.075472 | 1996.0 | 3.0 | 2.0 |
| 1217 | 4.0 | 0.648606 | 1978 | -2.084798 | 0.0 | 0.272615 | -0.021739 | 3.0 | 0.410256 | 1978.0 | 3.0 | 3.0 |
| 560 | 5.0 | 0.106792 | 2003 | 0.756980 | 0.0 | -0.317224 | -0.652174 | 2.0 | 0.872762 | 2003.0 | 2.0 | 2.0 |
| 685 | 5.0 | 0.549666 | 1998 | 0.386763 | 0.0 | 0.009913 | -0.456522 | 2.0 | 0.172230 | 1998.0 | 2.0 | 3.0 |
1110 rows × 12 columns
# One thing i first notice is : No Location features, ie Neighborhood, include in Top 10 related features.
# I try to visualize it to see if really there is no strong corellation between Neighborhood and SalePrice
sns.boxplot(x='Neighborhood',y='SalePrice',data=ameHouse)
plt.show()
# Based on some research about real estate, Neighborhood proved to played significant role here. But my mutual_information
# regression said the opposite.
# Here i decide to do a comparison, a model built using this top features based on mutual information without neighborhood
# and the same features with additional features from all neighborhood componenet.
# but before i do that, i will continue to perform another feature selection procedure.
# Manually statistical test
# Statistics Significance test is slassified based on sample value
plt.figure(figsize=(15,15))
sns.heatmap(scaled_all_num_cat_robust_trainX.corr())
plt.show()
# Spearman's rank Correlation with response feature
from scipy.stats import spearmanr
rhols = []
pls = []
for col in scaled_all_num_cat_robust_trainX.columns:
rho, p = spearmanr(scaled_all_num_cat_robust_trainX[col], scaled_train_y['SalePrice'])
rhols.append(rho)
pls.append(p)
# Convert it to dataframe
sign = []
for i in pls:
if i < 0.05:
sign.append('Significant')
else:
sign.append('Not significant')
spear_corr = pd.DataFrame({
'Feature' : scaled_all_num_cat_robust_trainX.columns,
'Corr' : rhols,
'p-value': pls,
'Significance': sign
})
# show the number of significant feature
len(spear_corr.loc[spear_corr['Significance']=='Significant'])
146
#145 out of 224 features proved to be statistically significant correlate to SalePrice
#first, let's drop all insignificant features
sig_features = spear_corr.loc[spear_corr['Significance']=='Significant']['Feature'].tolist()
sig_all_num_cat_robust_trainX = scaled_all_num_cat_robust_trainX[sig_features]
sig_all_num_cat_robust_valX = scaled_all_num_cat_robust_valX[sig_features]
# I will use these 145 features as model building component.
# but, now i am curious, what if, instead use all of them, i just use features with correlation above certain threshold
# For this experiment, i will use absolute mean of correlation rank as threshold
shapiro(spear_corr.loc[spear_corr['Significance']=='Significant']['Corr'])[1]
threshold_corr = spear_corr.loc[spear_corr['Significance']=='Significant']['Corr'].abs().median()
# filter out the features with correlation below median correlation value
sig_high_features = spear_corr[(spear_corr['Corr'] > threshold_corr) & (spear_corr['Significance']=='Significant')]['Feature'].tolist()
sig_high_num_cat_robust_trainX = scaled_all_num_cat_robust_trainX[sig_high_features]
sig_high_num_cat_robust_valX = scaled_all_num_cat_robust_valX[sig_high_features]
sig_high_num_cat_robust_trainX.shape
sig_high_num_cat_robust_valX.shape
(269, 43)
# Now, i am already have 6 datasets to train and compared: train-val data with highest MI score, train-val data with
# significant spearman correlation, and train-val data highest and significant spearman correlation
In this section, i'll start to build prediction model using these following algorithms:
Here i will train and validate the model quality using cross-validation method, then choose the highest value as the best model and predictor
# Function to score model
from sklearn.model_selection import cross_val_score
def score_model(X,y,model):
score= -1 * cross_val_score(model,X,y,cv=5,scoring="neg_mean_absolute_error")
score_mean = score.mean()
return score_mean
# For accuracy scoring purpose, i will concat train and val dataset
# This because cross-validation will shuffle all observation into n-parameters randomly, train-predict the data,
# and give average score of accurancy from each prediction subsets
# import Random Forest Regressor Module
from sklearn.ensemble import RandomForestRegressor
# import XGBRegressor Module
from xgboost import XGBRegressor
# merge train and validation dataset
forest_model = RandomForestRegressor(n_estimators = 200, random_state=0)
xgboost_model = XGBRegressor(n_estimators =500, learning_rate=0.01, random_state=0)
#baseline dataset
concat_scaled_all_num_cat_robust = pd.concat([scaled_all_num_cat_robust_trainX,scaled_all_num_cat_robust_valX],axis=0)
concat_scaled_all_num_cat_robust.shape
(1379, 225)
# Highest MI dataset
concat_mi_all_num_cat_robust = pd.concat([mi_all_num_cat_robust_trainX,mi_all_num_cat_robust_valX],axis=0)
concat_mi_all_num_cat_robust.shape
(1379, 12)
# All Significant Dataset
concat_sig_all_num_cat_robust = pd.concat([sig_all_num_cat_robust_trainX,sig_all_num_cat_robust_valX],axis=0)
concat_sig_all_num_cat_robust.shape
(1379, 146)
# High Correlated Significant Dataset
concat_sig_high_num_cat_robust = pd.concat([sig_high_num_cat_robust_trainX,sig_high_num_cat_robust_valX],axis=0)
concat_sig_high_num_cat_robust.shape
(1379, 43)
concat_scaled_y = pd.concat([scaled_train_y,scaled_val_y],axis=0)
concat_scaled_y.shape
(1379, 1)
print("Mean Absolute Error of baseline dataset in Random Forest Model: {} ".format(score_model(concat_scaled_all_num_cat_robust,
concat_scaled_y['SalePrice'],forest_model)))
Mean Absolute Error of baseline dataset in Random Forest Model: 0.2092848962723398
print("Mean Absolute Error of baseline dataset in XGBoost Model: {} ".format(score_model(concat_scaled_all_num_cat_robust,
concat_scaled_y['SalePrice'],xgboost_model)))
Mean Absolute Error of baseline dataset in XGBoost Model: 0.1987541564657423
print("Mean Absolute Error of MI_highest dataset in Random Forest Model: {} ".format(score_model(concat_mi_all_num_cat_robust,
concat_scaled_y['SalePrice'],forest_model)))
Mean Absolute Error of MI_highest dataset in Random Forest Model: 0.23560130661637196
print("Mean Absolute Error of MI_highest dataset in XGBoost Model: {} ".format(score_model(concat_mi_all_num_cat_robust,
concat_scaled_y['SalePrice'],xgboost_model)))
Mean Absolute Error of MI_highest dataset in XGBoost Model: 0.2330302579651387
print("Mean Absolute Error of all Significant dataset in Random Forest Model: {} ".format(score_model(concat_sig_all_num_cat_robust,
concat_scaled_y['SalePrice'],forest_model)))
Mean Absolute Error of all Significant dataset in Random Forest Model: 0.20750087976532589
print("Mean Absolute Error of all Significant dataset in XGBoost Model: {} ".format(score_model(concat_sig_all_num_cat_robust,
concat_scaled_y['SalePrice'],xgboost_model)))
Mean Absolute Error of all Significant dataset in XGBoost Model: 0.19794070810944578
print("Mean Absolute Error of High Significant dataset in Random Forest Model: {} ".format(score_model(concat_sig_high_num_cat_robust,
concat_scaled_y['SalePrice'],forest_model)))
Mean Absolute Error of High Significant dataset in Random Forest Model: 0.2123408166365283
print("Mean Absolute Error of High Significant dataset in XGBoost Model: {} ".format(score_model(concat_sig_high_num_cat_robust,
concat_scaled_y['SalePrice'],xgboost_model)))
Mean Absolute Error of High Significant dataset in XGBoost Model: 0.21109059713462117
# Create test dataset with all significant features only
sig_scaled_all_num_cat_robust_test = scaled_all_num_cat_robust_test[sig_features]
sig_all_num_cat_robust_trainX['Cluster']=sig_all_num_cat_robust_trainX['Cluster'].astype(np.int64)
sig_scaled_all_num_cat_robust_test[['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']] = sig_scaled_all_num_cat_robust_test[['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']].astype(np.int64)
sig_scaled_all_num_cat_robust_test[['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']].dtypes
YearBuilt int64 YearRemodAdd int64 GarageYrBlt int64 dtype: object
# After assessment, turns out the best model for this case is using all significant dataset, trained by XGBoostRegression model.
# So, i will use this to predict test data
xgboost_model.fit(sig_all_num_cat_robust_trainX,scaled_train_y)
salePrice_predict = xgboost_model.predict(sig_scaled_all_num_cat_robust_test).reshape(-1,1)
salePrice_predict
inversed_salePrice_predict = robustScaler_forPrice.inverse_transform(salePrice_predict)
inversed_salePrice_predict = inversed_salePrice_predict.tolist()
flat_salePrice_predict = sum(inversed_salePrice_predict,[])
#flat_salePrice_predict
#sub_salePrice_prediction = pd.DataFrame({'Id':ameHouse_test['Id'], 'SalePrice':flat_salePrice_predict})
#sub_salePrice_prediction
#sub_salePrice_prediction.to_csv("submission_ames_iowa_house_price_prediction.csv", index=False)